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Abstract 


In  this  paper  a  new  modeling  methodology  to  characterize  failure  processes  in  Time-Sharing 
systems  due  to  hardware  transients  and  software  errors  is  presented.  The  basic  assumption  made  is 
that  the  instantaneous  failure  rate  of  a  system  resource  can  be  approximated  by  a  deterministic 
function  of  time  plus  a  zero-mean  stationary  Gaussian  process,  both  depending  on  the  usage  of  the 
resource  considered.  The  probability  density  function  of  the  time  to  failure  obtained  under  this 
assumption  has  a  decreasing  hazard  function,  partially  explaining  why  other  decreasing  hazard 
function  densities  such  as  the  Weibull  fit  experimental  data  so  well.  Furthermore,  by  considering  the 
Operating  System  kernel  as  a  system  resource,  this  methodology  sets  the  basis  for  independent 
methods  of  evaluating  the  contribution  of  software  and  hardware  to  system  unreliability.  The 
modeling  methodology  has  been  validated  with  the  analysis  of  a  real  system.  The  predicted  system 
behavior  according  to  this  methodology  is  compared  with  the  predictions  of  other  models  such  as  the 
exponential,  Weibull,  and  periodic  failure  rate.  The  implications  of  this  methodology  are  discussed 
and  some  applications  are  given  in  the  areas  of  Performance/Reliability  modeling, software  reliability 
evaluation,  models  incorporating  permanent  hardware  faults,  policy  optimization,  and  design 
optimization. 
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Chapter  1 
Introduction 

In  the  early  1950’s,  the  mathematician  John  Von  Newmann  studied  problems  in  the  design  and 
construction  of  digital  computing  machines.  In  particular,  he  was  interested  in  the  following  problem: 
assume  that  one  has  a  collection  of  connected  elements  computing  and  transmitting  information  (an 
automaton)  and  each  element  is  subject  to  eventual  malfunction.  Can  one  arrange  and  organize  the 
elements  so  that  the  output  is  error  free  for  an  arbitrary  period  of  time  ? 

Although  encouraging,  experience  with  digital  computers  in  the  1950’s  had  some  drawbacks.  The 
ENIAC  (Electronic  Numerical  Integrator  and  Computer),  the  first  electronic  digital  computer,  had  been 
operating  since  the  mid  1940’s.  It  had  18,000  electronic  tubes,  each  tube  having  an  expected  life  of 
2,500  hours  [Goldstine  72].  Von  Newmann  later  proved  the  feasibility  of  a  computer  with  2,500 
vacuum  tubes  and  a  Mean  Time  Between  Failures  of  8  hours,  by  multiplexing  all  interconnections 
14,000  times,  a  requirement  "not  wholly  outside  the  range  of  our  (industrial  or  natural)  experience" 
[Von  Newmann  63]. 

The  fact  is  that  reliability  was  an  overwhelming  concern  for  the  designers  and  users  of  first 
generation  computers.  The  components  used  were  relays,  vacuum  tubes,  and  delay-line  storage 
devices.  All  had  relatively  high  failure  rates  and  were  subject  to  transient  faults.  Hence,  fault-tolerant 
techniques  were  developed  to  cope  with  component  unreliability.  The  use  of  parity  in  memories, 
duplication  or  triplication  and  voting,  instruction  retry,  and  other  hardware  fault  detection 
mechanisms  were  familiar  to  the  designers  of  those  early  computers. 

The  development  of  fault-tolerance  was  interrupted  by  the  rather  sudden  appearance  of 
semiconductor  circuits  and  ferrite  cores  as  digital  system  components.  Hardware  suddenly  became 
so  "good"  that  in  the  1960’s  the  responsibility  of  maintaining  operation  was  relegated  by  default  to  • 
the  system  software.  A  typical  example  is  the  MULTICS  system  of  M.l.T.  [Corbato  74], 

At  present,  it  is  clear  that  fault-tolerant  is  a  desirable  attribute  of  computing  systems.  The  cost 
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associated  with  a  computer  failure  in  say,  a  spacecraft,  clearly  justifies  the  use  of  fault-tolerant 
techniques.  But  there  are  many  other ,  more  trivial,  applications  where  unreliability  is  undesirable.  In 
transaction  processing  systems,  airline  reservation  systems,  or  even  in  general  purpose  computation 
centers  a  system  failure  or  "crash"  is  associated  with  a  user  delay  to  finish  one  or  several  tasks. 
Unreliability  of  current  digital  computing  systems  does  not  arise  from  poor  quality  of  the  components 
but  from  the  continuous  growth  in  systems  size  and  complexity.  Typical  components  reliability  is 
measured  in  failures  per  million  hours.  But  the  number  of  components  used  is  usually  very  large  and 
component  malfunction  (either  temporary  or  permanent)  eventually  leads  to  system  failures  (at  least 
once  a  day  for  typical  Time  Sharing  systems).  Furthermore,  another  factor  appears  now  to  be 
specially  relevant  to  system  reliability  •  the  correctness  of  the  programs  managing  the  use  of  system 
resources,  that  is,  software  reliability. 

This  thesis  deals  with  reliability  characterization  of  digital  computing  systems.  In  particular,  it  is 
concerned  with  the  behavior  that  users  will  observe  in  their  everyday  use  of  Time  Sharing  systems, 
and  about  quantifying  the  impact  of  unreliable  behavior  at  the  system  level.  The  approach  taken  has 
been  motivated  mainly  by  the  following  two  facts: 

•  The  desire  to  formulate  a  hardware/software  reliability  prediction  model. 

•  The  good  fit  of  the  Weibull  distribution  to  experimentally  obtained  failure  data  of  several 
computing  systems  [McConnel  79a]. 

These  two  points  deserve  some  more  explanation. 


1 .1 .  Hardware/Software  reliability  prediction 


The  following  definition  has  been  drafted  by  the  Software  Reliability  Committee  of  the  IEEE 
Reliability  Society 

Definition  1:  A  Compatible  Hardware/Software  Reliability  Prediction  Model  is  a 
suitable  interpretation  of  hardware  and  software  mathematical  relationships  for  combined 
computation  so  as  to  make  feasible  prediction  of  system  reliability  [SRC  81]. 


Prediction  models  describe  the  mathematical  relationships  between  certain  system  parameters.  A 
combined  hardware/software  prediction  model  would  allow  the  evaluation  of  the  impact  of  each 
cause  of  unreliability  on  the  observed  behavior  at  the  system  level.  Prediction  models  can  be  used  to 
refine  a  design  before  actually  implementing  it,  or  to  optimize  the  policies  regulating  the  use  of 
systems  already  operational. 
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At  present,  no  such  modeling  methodology  is  available.  As  will  be  described  in  Chapter  2,  current 
hardware  and  software  modeling  efforts  are  unconnected,  preventing  the  formulation  of  a  unified 
view  of  system  behavior.  Perhaps  one  of  the  reasons  why  more  fault  tolerance  is  not  found  in  systems 
today  is  due  to  this  lack  of  cost/benefit  analysis  techniques. 

1 .2.  The  Weibul!  distribution 

If  the  expectation  of  having  a  combined  hardware/software  reliability  prediction  model  is  desirable, 
the  findings  about  the  Weibull  distribution  are  intriguing.  After  collecting  failure  data  from  several 
systems,  a  research  group  at  Carnegie- Mellon  University  reached  the  following  two  conclusions : 

•  Hardware  unreliability  is  mainly  due  to  transients  as  opposed  to  permanent  faults 
[Siewiorek  78]. 

•  The  Weibull  distribution  fits  experimental  data  extremely  well  [McConnel  81  ]. 

The  Weibull  distribution  was  originally  presented  by  Prof.  Weibull  in  an  article  dealing  with  fatigue 
resistance  of  steel  [Weibull  51].  Prof.  Weibull's  goal  was  to  find  a  single  distribution  of  wide 
applicability  that  would  comprise  other  distributions  as  special  cases.  The  Probability  Distribution 
Function  (PDF)  of  the  time  to  failure  is  given  in  the  case  of  the  Weibull  distribution  by 

-(At)® 

P(tf2sT)  =  1  -  e  (1.1) 

Note  that  for  a  =  1  the  Weibull  distribution  becomes  an  exponential .  For  a  =  2  equation  (l.l)  becomes 
the  Rayleigh  distribution.  For  a<1  equation  (1.1)  has  a  decreasing  hazard  function,  a  concept  which 
will  be  formally  introduced  in  Chapter  3  but  that  can  be  described  intuitively  as  follows.  If  h(t)  is  the 
hazard  function  of  a  system  at  time  t,  h(t)At  is  the  instantaneous  probability  of  observing  a  system 
failure  on  the  infinitesimal  interval  [t,t+  At).  For  the  Weibull  distribution, 

h(t)  =  (1.2) 

(At)1* 

which,  for  a<1  is  a  decreasing  function  of  time.  Similarly,  for  a>1,  h(t)  is  an  increasing  function  of 
t.  [McConnel  79b]  shows  that  the  Weibull  distribution  with  a<l  closely  fits  the  distribution  of  the  time 
to  failure  of  digital  computing  systems.  Therefore,  the  instantaneous  probability  of  observing  a  failure 
in  a  computing  system  decreases  as  the  system  is  operating.- 

Figure  1-1  illustrates  the  behavior  of  the  decreasing  hazard  function  Weibull.  The  system  is  started 
at  time  tQ  and  failures  occur  at  times  tr  t2,...  The  system  is  restarted  immediately  after  each  failure. 
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Figure  1-1:  Instantaneous  probability  of  failure  as  a  function  of  operating 
time  for  a  computing  system  according  to  the  Weibult  distribution 

The  instantaneous  probability  of  system  failure  rises  after  each  failure  and  decreases  from  then  on 
until  the  next  failure.  This  behavior  is  surprising  since  computers  are  rarely  switched  off,  therefore  not 
experiencing  a  warm-up  transient  period  (a  possible  cause  of  early  failure).  Although  computer 
folklore  uses  the  words  "cold  start"  and  "warm  start”  to  describe  different  software  initialization 
sequences,  software  is  also  believed  to  be  temperature  insensitive.  Why  should  computing  systems 
exhibit  such  behavior?  What  are  the  implications  of  such  behavior?  Should  all  users  of  a  computation 
center  walk  out  of  the  terminal  room  every  time  that  the  system  is  restarted  and  come  back  later  when 
the  probability  of  failure  is  sufficiently  small?  Or  is  this  just  a  mathematical  paradox  irrelevant  to 
system  characterization?  All  these  questions  will  be  answered  in  the  following  chapters. 


1.3.  Organization  of  the  research 

To  quantify  the  impact  of  unreliability  in  a  variety  of  situations,  a  compatible  hardware/software 
reliability  prediction  model  will  be  created.  This  thesis  deals  with  the  formulation  of  such  a  model,  its 
validation,  and  the  main  conclusions  draw  from  its  predictions. 

Chapter  2  is  an  overview  of  current  techniques  for  reliability  characterization,  causes  of 
unreliability,  and  existing  modeling  methods. 
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In  Chapter  3  the  proposed  modeling  methodology  is  formally  developed.  The  emphasis  is  on  the 
generality  of  the  results  obtained.  It  is  expected  that  some  of  these  results  will  be  applicable  to  the 
characterization  of  the  reliability  of  other  complex  systems  besides  digital  computers. 

The  results  presented  in  Chapter  3  are  specialized  in  Chapter  4,  were  a  study  of  systems  under 
constant  or  periodic  workload  is  made. 

Chapter  5  discuses  the  problems  associated  with  model  validation.  A  real  system  is  modeled  in 
detail,  and  the  necessary  techniques  for  measuring  and  estimating  the  required  model  parameters  are 
also  given. 

Chapter  6  contains  a  comprehensive  study  of  the  similarities  and  differences  between  the  model 
presented  in  this  thesis  and  other  modeling  efforts.  Numerical  comparisons  between  predicted  and 
observed  behavior  are  also  given. 

Chapter  7  contains  some  applications  derived  from  the  new  modeling  methodology.  Several 
examples  are  given  which  show  how  to  use  the  model  in  order  to  optimize  operational  policies  or 
quantify  the  impact  of  unreliability  on  the  performance  of  a  digital  computer  system.  Although 
through  the  Thesis  the  emphasis  is  on  characterizing  unreliability  due  to  hardware  transients  and 
incorrect  software,  an  extension  of  the  modeling  methods  incorporating  permanent  hardware  faults  is 
also  given  in  this  Chapter. 

Finally,  Chapter  8  summarizes  the  results  of  the  previous  chapters  and  suggests  directions  for 


further  research. 


BACKGROUND 


7 


Chapter  2 
Background 

The  different  approaches  traditionally  used  to  characterize  system  reliability  will  be  examined  in 
this  chapter.  After  some  definitions  given  in  Section  2.1,  the  problem  of  characterizing  computing 
systems  reliability  is  introduced  in  Section  2.2.  Sections  2.3  and  2.4  describe  the  two  major 
contributions  to  system  unreliability:  hardware  transients  and  incorrect  software.  The  modeling 
methodology  presented  in  this  thesis  will  be  validated  with  a  TOPS-IO  Operating  System.  Section  2.5 
presents  the  validation  requirements.  Amid  the  description  of  MULTICS  like  protection  mechanisms 
(of  which  TOPS- 10  is  an  example)  the  new  modeling  methodology  will  be  presented. 

2.1 .  Definitions 

The  reliability  of  a  system  is  a  measure  of  how  successfully  a  system  conforms  to  some 
specification  of  its  behavior.  A  failure  is  any  deviation  of  system  behavior  from  its  specifications. 
System  specifications  usually  define  the  external  state  of  the  system,  and  failures  will  be  detected  as 
anomalous  external  states.  The  following  definitions  apply  only  to  operational  systems,  not  to 
systems  undergoing  development,  debugging,  or  testing  (this  distinction  is  important  in  the  case  of 
software  reliability  modeling).  These  definitions  were  first  given  by  [Randell  78]. 

Definition  1 :  An  error  is  that  part  of  the  (internal)  system  state  which  is  incorrect  in  the 
sense  that  further  processing  within  the  specifications  of  use  will  lead  to  a  failure. 

Definition  2:  A  fault  is  the  electrical,  mechanical,  or  algorithmic  cause  of  an  error.  A 
potential  fault  is  a  fault  that  under  some  circumstances  within  the  specifications  of  use  will 
cause  an  error. 

Definition  3:  A  permanent  hardware  fault  is  an  irreversible  electrical  or  mechanical 
cause  of  errors.  The  internal  state  of  a  system  in  the  presence  of  permanent  hardware 
faults  is  continuously  incorrect. 

Definition  4:  A  transient  hardware  fault  is  a  fault  due  to  temporary  environmental, 
mechanical,  or  electrical  conditions. 
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Definition  5:  A  software  fault  is  an  algorithmic  cause  of  errors. 

Note  that  software  and  transient  hardware  faults  are  always  potential  faults  since  they  will  lead  to 
errors  only  under  certain  circumstances  of  use.  For  operational  systems,  transient  hardware  faults 
and  software  faults  are  indistinguishable  in  the  sense  that  they  are  irreproducible  errors. 

Definition  6:  A  system  failure  is  the  external  state  manifestation  of  an  error  such  that 
the  entire  computing  system  has  to  stop  operating. 

Since  no  repair  takes  place  after  system  failures  due  to  software  faults  or  transient  hardware  faults, 
the  time  of  system  failure  is  essentially  equal  to  the  system  restart  time.  This  thesis  is  solely  concerned 
in  modeling  hardware  transient  faults  and  software  faults.  Thus,  the  words  system  failure  and  system 
restart  will  be  used  interchangeably  to  describe  the  same  event  in  time. 


2.2.  The  problem  of  characterizing  system  reliability 

Fault-tolerance  has  traditionally  been  characterized  by  relatively  simple  functions  based  on  strict 
assumptions.  The  Reliability  function  R(t)  is  defined  as  the  probability  of  uninterrupted  operation  up  to 
time  t  given  that  all  hardware  was  correctly  operating  at  time  t  =  0.  R(t)  may  be  used  to  characterize 
either  permanent  or  transient  faults.  The  usual  assumption  is  made  that  the  failure  rate  is  constant 

-\t 

and,  for  nonredundant  systems,  the  reliability  function  becomes  e  ,  where  A  is  is  the  sum  of  the 
failure  rates  of  all  the  components  in  the  system.  A  very  common  quantitative  measure  is  the  Mean 
Time  To  Failure  (MTTF) 

oo 

MTTF  =  /  R(t)  dt  (2.1) 

Jo 


The  popularity  of  the  MTTF  stems  mainly  from  the  fact  that,  for  nonredundant  systems,  it  is  easily 
estimated  by  dividing  the  time  a  system  is  operational  by  the  number  of  failures  reported.  Other 
reliability  indices  used  to  compare  two  systems  A  and  B,  are  the  Reliability  Improvement  factor  (RIF) 
[Anderson  67] 
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RIF 


1-RA» 

i-R8(t) 


and  the  Mission  Time  Improvement  Factor  (MTIF)  [Bouricious  69] 


(2.2) 


MTIF 


when  R.(TJ  =  Rr(Tb)  =  R„ 


(2.3) 


which  are  useful  only  when  the  system  under  study  must  be  available  for  a  predetermined  period  of 
time  T  called  "mission  time". 

The  concept  of  coverage  [Bouricious  69]  is  defined  as  the  conditional  probability  of  successful 
recovery,  given  that  a  fault  has  occurred.  Although  mathematically  attractive,  coverage  has  proven  to 
be  very  difficult  to  estimate  for  real  systems.  Finally,  if  the  Mean  Time  To  Repair  (MTTR)  is  also 
known,  an  estimate  of  the  system  usefulness  given  by  the  Availability  that  for  non  redundant  systems 
is  given  by 


A  .  MTTF 

MTTF  +  MTTR 


(2.4) 


2.2.1 .  Performance-Reliability  evaluation 

The  above  measures  do  not  take  into  account  the  performance  of  the  system  whose  reliability  is 
being  measured.  Consider  Table  2-1  which  lists  the  results  obtained  from  seven  different  experiments 
whose  goal  was  explicitly  to  gain  experience  on  systems  reliability.  Data  for  the  first  system  [Yourdon 
72],  was  obtained  from  a  summary  of  failure  statistics  on  a  Borroughs  5500  over  a  15  month  period 
starting  in  April  of  1969.  Limited  information  about  the  cause  of  each  failure  is  available.  For  instance, 
one  of  the  categories  includes  system  failures  due  to  unexpected  I/O  intercepts.  These  failures  are 
recorded  whenever  the  software  responds  to  an  interrupt  signifying  that  some  I/O  action  has  taken 
place,  but  discovers  that  it  has  no  record  of  having  initiated  such  action.  It  is  thus  an  indication  of 
some  form  of  hardware  or  software  error  but  the  particular  cause  for  the  failure  (hardware  or 
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software)  remains  unknown.  The  data  for  the  second  system  was  reported  in  [Lynch  75}  and  comes 
from  the  first  thirteen  months  in  the  life  of  an  operating  system  called  Chi/OS  for  the  Univac  1108 
developed  by  the  Chi  Corporation  between  1970  and  1973.  No  explanation  is  given  about  how  such 
an  accurate  decomposition  of  failures  due  to  hardware  and  software  could  be  obtained.  [Reynolds 
75]  reports  three  years  of  data  obtained  from  a  dual  IBM  370/165  at  Hughes  Aircraft  Company 
installed  to  handle  a  mixed  batch  and  time  sharing  load.  The  fourth  system  is  at  the  Stanford  Linear 
Accelerator  Center  (SLAC)  where  the  main  workload  is  processed  as  multi-stream  background  batch. 
The  system  consists  of  a  foreground  host  (IBM  370/168)  and  two  background  batch  servers  (IBM 
370/168  and  IBM  360/91).  The  architecture  is  designed  to  be  highly  available  and  reconfigurable. 
The  fifth  system  is  the  CMU-10A,  an  ECL  POP- 10  used  in  the  Computer  Science  Department  at 
Carnegie-Mellon  University.  The  data  for  the  CRAY-1  was  reported  in  [Keller  76],  and  the  data  for  the 
three  generic  UNIVAC  systems  was  reported  in  [Siewiorek  80]. 

Table  2-1  gives,  when  available,  a  Mean  Time  to  restart  (MTTS)  value  in  hours  (that  is,  the  Mean 
Time  to  System  Failure),  a  Mean  Number  of  Instructions  to  Restart  (MNIR)  which  is  an  estimate  of  the 
mean  number  of  instructions  executed  from  system  start  up  until  system  failure,  and  the  percentages 
of  system  failures  that  were  caused  by  hardware  faults,  software  faults,  and  faults  whose  cause  could 
not  be  resolved.  The  information  about  execution  rates  needed  to  compute  the  MNIR  value  was 
obtained  from  [Phister  79]. 

Obviously,  the  figures  shown  in  Table  2-1  do  not  carry  much  information.  A  MTTS  figure  alone  does 
not  tell  the  impact  of  unreliability  on  system  use.  Compare  for  example  the  CRAY-1 ,  [Russell  78],  with 
the  CMU-10A,  [Bell  78].  Although  the  CRAY-1  crashes  twice  as  often  as  the  CMU-10A,  it  can  operate 
continuously  at  rates  above  138  Million  Instructions  Per  Second  (MIPS),  while  the  CMU-10A  operates 
at  1.2  MIPS.  Hence  the  CMU-10A  executes  ~1010  instructions  between  crashes  while  the  CRAY-1 
executes  ~1012  instructions  between  crashes.  Inconsistencies  like  this  one  suggest  that  reliability 
modeling  and  measuring  should  be  closely  related  with  the  characterization  of  the  performance  of  the 
system  under  study. 

Integrated  performance-reliability  models  have  already  started  to  appear  in  the  literature.  In  [Meyer 
79],  a  performance  measure  called  "performability"  gives  the  probability  that  a  system  performs  at 
different  levels  of  "accomplishment".  In  [Gay  79],  systems  are  modeled  with  Markov  processes  in 
order  to  estimate  the  probability  of  being  in  one  of  several  capacity  states.  This  is  a  similar  approach 
to  the  one  previously  taken  in  [Beaudry  78],  where  the  concept  of  "computation  reliability"  was 
introduced  as  a  measure  which  takes  into  account  the  computation  capacity  of  a  system  in  each 
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System 

MTTS  (hours) 

MNIR 

%  HW 

%SW 

%  Unknown 

B  5500 

14.7 

2.6  1010 

39.3 

8.1 

52.6 

Chi/05 

Univac  1108 

17 

6.7  1010 

45 

55 

- 

dual 

370/165 

8.86 

2.8  1011 

65 

32 

3 

SLAC 

20.2 

2.3  1011 

73.3 

21.6 

5.1 

CMU-10A 

10 

4.3  1010 

- 

- 

- 

CRAY-1 

4 

1.9  1012 

- 

• 

• 

UNIVAC 

(Large) 

51 

42 

7 

UNIVAC 

(Medium) 

57 

41 

2 

UNIVAC 

(Small) 

88 

9 

3 

Table  2-1:  Reliability  experience  of  several  commercial  systems.  MTTS  is 
the  Mean  Time  to  restart.  MNIR  is  the  Mean  Number  of  Instructions  to 
Restart. 


possible  operational  state.  A  Performance/Availability  model  for  gracefully  degrading  systems  with 
critically  shared  resources  is  given  in  [Chou  80].  Finally,  in  [Moreira  80]  a  model  is  described  which 
predicts  the  cost  reduction  associated  with  different  values  of  coverage,  repair  rate,  and  diagnosis 
time.  An  example  shows  how  the  advantage  of  using  a  N-redundant  system  can  be  quantified 
assuming  only  permanent  hardware  faults. 
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2.2.2.  Causes  of  unreliability 

Most  of  the  above  models  have  been  developed  mainly  for  hard  failures,  that  is,  stable  failures  that 
reflect  an  irreversible  physical  change  in  the  hardware.  Unfortunately,  as  it  has  been  repeatedly 
reported  ( [Fuller  78],  [McConnel  79b],  [Morganti  78],  [Siewiorek  78],  [Ohm  79]),  transient  failures 
occur  at  least  an  order  of  magnitude  more  often  than  hard  failures.  A  cost  effective  analysis  should 
then  consider  transients  as  the  main  reason  for  system  unreliability. 

Simultaneously  with  the  developments  described  above,  qualitative  relationships  between 
workload  and  unreliability  have  also  been  noted.  The  results  published  in  [Beaudry  79]  suggest  a 
strong  dependency  between  workload  and  reliability  of  digital  computing  systems.  And  in  the  paper 
by  [Butner  80],  this  dependency  is  stated  explicitly  claiming  that  a  periodic,  workload-dependent 
failure  rate  is  more  appropriate  to  characterize  the  reliability  of  time  sharing  systems  than  the 
classical  constant  failure  rate  model  traditionally  used.  As  reported  in  [Castillo  80a],  if  such  a 
dependency  is  taken  into  account  it  is  possible  to  characterize  the  performance  of  digital  computing 
systems  considering  reliability  as  an  inherent  attribute. 

Another  factor  affecting  computing  systems  reliability  is  the  reliability  of  the  software  managing  the 
use  of  system  resources.  Faults  in  the  software  which  force  a  crash  and  restart  operation  are  not 
uncommon.  They  occur,  in  most  commercial  systems,  much  more  often  than  permanent  hardware 
faults,  and  their  effects  are  similar  to  the  effects  of  hardware  transient  faults.  The  conditions  under 
which  a  software  fault  generate  an  error  are  usually  impossible  to  determine  (as  soon  as  these 
conditions  are  determined,  the  software  can  be  corrected).  Hence,  the  software  faults  remaining  in  an 
operational  system  are  obscure  and  manifest  themselves  only  upon  particular  (but  unknown) 
conditions. 

In  summary,  the  problems  currently  relevant  to  computer  reliability  characterization  are 

1.  Predominance  of  hardware  transient  faults  over  permanent  hardware  faults. 

2.  Software  unreliability 

3.  Disconnection  between  reliability  evaluation  and  performance  evaluation. 

The  following  sections  elaborate  upon  these  issues  in  more  detail. 


BACKGROUND 


13 


2.3.  Hardware  transient  faults 

Hardware  transient  faults  are  induced  by  temporary  environmental,  electrical,  or  mechanical 
conditions.  Their  effects  include  flipping  a  single  bit  in  the  main  memory  of  a  computer  (due  to  the 
emission  of  an  alpha  particle  by  radioactive  elements  present  in  1C  packaging),  reading  erroneous 
information  from  a  magnetic  disk  (due  to  inaccurate  positioning  of  the  reading  heads),  resetting  all 
CPU  registers  (due  to  a  power  glitch),  or  receiving  erroneous  information  from  a  bus  (due  to 
electromagnetic  radiation  received  by  a  bus  acting  like  an  antenna). 

Although  a  given  transient  may  occur  more  than  once  in  the  lifetime  of  a  computer,  its  effects  are 
essentially  unpredictable.  Consider  a  single  bit  in  memory  that  flips  its  value  due  to  the  emission  of  an 
alpha  particle.  If  that  bit  is  not  storing  information  at  the  time  that  the  transient  occurs,  and  the  value 
of  the  bit  is  overwritten  before  being  read,  the  transient  passes  unnoticed.  However,  if  the  same  bit  is 
part  of  a  pointer  to  one  of  the  operating  system  critical  data  structures,  the  entire  computer  system 
may  crash. 

The  physical  processes  generating  transient  faults  generation  are  presumably  sparse  since, 
according  to  Table  2-1  many  commercially  available  systems  are  able  of  continued  correct  operation 
for  several  hours  in  spite  of  being  built  out  of  a  large  number  of  components.  Nevertheless, 
sparseness  is  not  equivalent  to  total  absence  and  as  computing  systems  become  more  complex  the 
impact  of  transient  faults  may  become  harder  to  evaluate  unless  due  attention  is  paid  to  this  problem 
during  the  design  process. 

2.3.1.  Causes  of  hardware  transient  faults 

Some  causes  of  transients  are : 

•  Limitations  in  the  accuracy  of  electromechanical  devices  (such  as  the  positioning 
servomechanism  for  the  reading  heads  of  a  disk  drive). 

•  Electromagnetic  radiation  received  by  interconnections  (such  as  long  buses  acting  like 
receiving  antennas). 

•  Power  fluctuations  or  glitches  not  properly  filtered  by  the  power  supply. 

•  Effects  of  ionizing  radiation  on  semiconductor  devices 

This  last  cause  is  currently  the  most  important  challenge  to  device  designers  and  requires  some 
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more  explanation.  It  has  been  only  recently  that  the  effects  of  ionizing  radiation  have  been  recognized 
as  a  source  for  "soft"  faults  in  computer  memories  [May  79].  "Soft"  means  that  the  information  held 
in  a  memory  device  has  changed,  but  no  irreversible  change  in  the  device  has  occurred.  Information 
in  computer  memories  is  stored  as  the  presence  or  absence  of  charge  in  capacitors.  When  an 
energetic  particle  creates  electron-hole  pairs  in  the  vicinity  of  a  capacitor,  some  of  the  added  charge 
carriers  are  collected  by  the  capacitor.  If  the  added  charge  is  sufficiently  large,  the  information  stored 
is  changed.  The  amount  of  charge  that  represents  a  bit  and  the  "critical"  charge  that  is  needed  to 
change  it  have  decreased  with  miniaturization  and  the  advent  of  VLSI  technology.  Transient  failures 
in  semiconductor  memories  due  to  ionizing  radiation  were  not  significant  until  the  introduction  of  16K 
bit  and  64K  bit  memory  chips. 

Two  main  causes  have  been  detected  so  far  as  sources  of  ionizing  radiation  which  affect  the 
operation  of  digital  computers 

•  Trace  amounts  of  natural  radioactive  elements  in  metallic  and  ceramic  packaging 
materials  [Geilhufe  79] 

•  The  effect  of  cosmic  rays  [Ziegler  79] 

Although  the  effect  of  radioactive  materials  in  packaging  materials  can  be  reduced  by  further 
purification  and  better  system  design,  it  is  not  clear  how  the  effects  of  cosmic  rays  can  be  avoided 
[Keyes  81]. 

Soft  errors  are  sparse.  The  designer  of  a  16  bit  1M  word  system  (built  out  of  16K  dynamic  RAM 
chips)  would  observe  a  Mean  Time  To  Soft  Failure  due  to  alpha  particles  of  ~40  days  [Geilhufe  79]. 
For  a  system  of  the  same  size  (built  out  of  64K  dynamic  RAM  chips)  the  Mean  Time  To  Soft  Failure 
due  to  cosmic  rays  would  be  of  ~16  days  at  sea  level,  ~4  hours  at  30,000  feet  [Ziegler  79].  However, 
a  soft  error  is  completely  removed  by  the  following  write  cycle.  Thus,  as  pointed  out  in  [Smith  81],  the 
observed  soft  failure  rate  depends  on  the  frequency  between  writes  or  rewrites. 

2.3.2.  Hardware  transient  faults  modeling 

There  are  three  basic  approaches  to  hardware  transient  modeling.  In  each  approach,  the 
Probability  Distribution  Function  (PDF)  of  the  time  to  Failure  is  assumed  to  be  either  an  Exponential 
distribution,  a  Weibull  distribution,  or  an  Exponential  distribution  with  periodic  failure  rate. 
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2.3.2. 1.  Exponential  distribution 


The  most  wideiy  used  model  for  failure  process  characterization  assumes  the  failure  process  to  be 
a  homogeneous  Poisson  process.  The  PDF  of  the  time  to  failure  is  then  given  by 


•X  r 

Pa(t<r)  =  1  •  e  e 


(2.5) 


where  Ae  is  the  (constant)  failure  rate.  The  maximum  likelihood  estimate  of  \e  is  obtained  simply  by 
dividing  the  time  that  the  system  has  been  operational  by  the  number  of  failures  reported.  All 
functions  and  parameters  related  to  this  model  will  be  noted  with  subindex  "e"  and  from  now  on  this 
model  will  be  referred  to  as  the  exponential  model. 


2. 3. 2. 2.  Weibull  distribution 


Empirical  studies  [McConnel  79b]  have  shown  that  a  Weibull  distribution  gives  a  better  goodness  of 
fit  to  experimental  data  than  a  simple  exponential.  The  Weibull  PDF  is  given  by 


Pw(t<r)  -  1 


(X  r)aw 


(2.6) 


The  Weibull  distribution  is  characterized  by  two  parameters  :  Aw  ,  the  scale  parameter,  and  aw  ,  the 
shape  parameter.  For  «w  =  1,  the  Weibull  distribution  degenerates  to  the  exponential.  For  aw>1,  the 
Weibull  distribution  has  an  increasing  failure  rate.  A  decreasing  failure  rate  corresponds  to  «w<1.  All 
reports  published  to  date  claim  that  a  decreasing  failure  rate  Weibull  distribution  fits  experimental 
data  much  better  than  a  simple  exponential  model.  Numerical  procedures  have  been  developed  to 
find  the  maximum  likelihood  estimates  of  Aw  and  aw-  These  procedures  are  based  on  the  works  of 
[Thoman  69,  Berger  74,  Romano  77]  and  FORTRAN  programs  implementing  them  are  given  in 
[McConnel  79a]. 


2. 3. 2. 3.  Exponential  distribution  with  periodic  failure  rate 

A  workload  dependent  model  has  been  presented  in  [Butner  80].  A  linear  or  quadratic  dependency 
between  failure  rate  and  workload  is  also  assumed.  The  workload  is  characterized  by  a  periodic 
function  of  time.  The  proposed  PDF  becomes  an  exponential  "modulated"  by  a  periodic  function 

-K  T  -F  U  (T) 

PD(t<T)  =  1  -  e  p  e  p  p  (2-7) 

where  F  is  defined  as  the  load  induced  failure  rate,  U  (r)  denotes  the  instantaneous  load  value,  and 
o  p 

Kp  is  a  workload  independent  failure  rate.  This  model  will  be  referred  to  as  the  periodic  model,  all  its 
parameters  having  the  subindex  "p". 
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Although  it  has  not  been  used  for  reliability  characterization,  a  periodic  failure  rate  has  been  also 
assumed  in  a  model  to  determine  the  optimum  checkpointing  interval  in  a  transaction  processing 
system  by  [Chandy  75aJ.  The  assumptions  in  this  model  are  that  transaction  processing  systems  often 
operate  under  periodic  demand,  leading  to  a  periodic  failure  rate.  The  optimum  checkpointing 
interval  is  determined  such  that  the  cost  associated  both  with  checkpointing  and  recovering  from 
failures  is  minimized. 

For  systems  operating  under  periodic  workload  an  alternative  approach  is  to  brake  a  period  into  M 
discrete  intervals.  The  system  workload  and  failure  rate  are  assumed  to  be  constant  in  each  interval. 
This  approach  has  been  used  by  [Chandy  75b]  to  evaluate  the  optimum  checkpointing  interval  in 
transaction  processing  systems,  and  by  [Beaudry  80]  to  characterize  the  reliability  of  a  multiprocessor 
for  avionics  applications  and  the  reliability  of  the  SLAC  system  described  in  Section  2.2.1 . 

2.3. 2.4.  Discussion 

The  popularity  of  the  exponential  model  arises  mainly  from  its  simplicity.  The  exponential  model 
may  be  a  useful  abstraction  to  characterize  how  failures  occur.  However  the  validity  of  the 
exponential  model  is  not  sustained  by  the  data  collected  about  how  errors  are  detected.  A  Weibull 
distribution  with  «w<1  seems  a  much  better  choice.  On  the  other  hand,  for  systems  in  steady  state 
operation,  the  periodic  model  tries  to  incorporate  the  fact  that  the  observed  unreliability  should 
depend  on  patterns  of  usage,  not  on  a  constant  set  of  parameters  as  the  Weibull  model  implicitly 
implies.  This  apparent  conflict  will  be  solved  in  the  present  thesis,  where  it  is  shown  that  each  of  the 
above  three  models  is  a  special  case  of  a  more  general  characterization. 


2.4.  Software  Reliability 

The  problem  of  software  reliability  assessment  is  part  of  the  more  general  area  of  software  quality 
assessment  [Mohanly  73].  Effective  mechanisms  for  measuring  software  quality  are  required  due  to 
the  high  cost  of  software  development  and  maintenance.  By  1985  forecasts  indicate  that  over  90%  of 
the  total  computing  dollars  spent  annually  will  be  for  software  [Horowitz  75].  The  development  of 
techniques  for  measuring  software  reliability  has  been  motivated  mainly  by  project  managers  that 
require  models  to  estimate  the  man-power  needed  to  develop  a  software  system  with  a  given  level  of 
performance  and  measuring  techniques  to  detect  when  this  level  of  performance  has  been  reached. 
However,  most  software  reliability  models  presented  to  date  are  far  from  satisfying  these  two  needs  in 
a  general  context. 
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There  are  basicly  two  approaches  dealing  with  the  design  of  reliable  software.  The  first  approach 
consists  in  specifying  the  desired  software  behavior  as  accurately  as  possible  and  to  develop  error 
free  software  according  to  the  specifications.  Thus,  this  approach  deals  with  the  development  of  fault- 
intolerant  software,  and  it  implicitly  assumes  that  it  is  possible  to  develop  software  packages  of 
arbitrary  size  and  complexity  which  are  completely  error  free.  The  second  approach  acknowledges 
the  fact  that  to  write  completely  error  free  software  is  either  impossible  or  excessively  costly.  Thus, 
fault-tolerant  software  is  written  which  takes  into  account  the  possibility  of  software  faults  and 
provides  mechanisms  for  recovering  from  their  effects. 

Unfortunately,  the  words  "software  reliability  model"  usually  refer  to  mathematical  models  dealing 
with  software  reliability  assessment  during  the  design  of  fault-intolerant  software.  This  is  a  much  more 
restrictive  concept  than  the  general  set  of  tools  used  to  predict,  calibrate,  and  characterize  the 
reliability  of  software  in  a  variety  of  environments  (which  is  what  the  words  "software  reliability 
model"  would  suggest  to  a  novice). 

2.4.1.  Fault-intolerant  software  reliability  assessment 

Software  reliability  models  (in  the  restricted  sense  described  above)  can  be  roughly  grouped  into 
four  categories.  The  first  category  would  include  models  formulated  in  the  time  domain.  These 
models  attempt  to  relate  software  reliability  (characterized,  for  instance,  by  a  MTTF  figure  under 
typical  workload  conditions)  to  the  number  of  bugs  present  in  the  software  at  a  given  time  during  its 
development.  Typical  of  this  approach  are  the  models  presented  in  [Shooman  73],  [Musa  75],  and 
[Jelinsky  73].  Bug  removal  should  increase  MTTF  and  correlation  of  bug  removal  history  with  the  time 
evolution  of  the  MTTF  value  may  allow  the  prediction  of  when  a  given  MTTF  value  will  be  reached.  An 
example  of  the  application  of  time  domain  models  to  the  development  of  a  real-time  system  is  given  in 
[Miyamoto  75].  The  main  disadvantages  of  time  domain  models  are  that  they  do  not  usually  take  into 
account  that  bug  correction  can  generate  more  bugs,  and  that  software  unreliability  can  be  due  not 
only  to  implementation  errors  (bugs)  but  also  to  design  (specification)  errors. 

Another  approach  to  software  reliability  modeling  is  based  on  studying  the  data  domain.  The  first 
model  of  this  kind  was  described  by  [Nelson  73].  In  principle,  if  sets  of  all  input  data  values  upon 
which  a  computer  program  can  operate  are  identified,  an  estimate  of  the  reliability  of  the  program  can 
be  obtained  by  running  the  program  for  a  subset  of  input  data  values.  A  more  detailed  description  of 
data  domain  techniques  is  given  in  [Thayer  78].  In  the  paper  by  [Schick  78]  the  time  domain  and  data 
domain  models  are  compared.  However,  different  applications  will  tend  to  use  different  subsets  of  all 
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possible  input  data  values,  "seeing"  different  reliability  values  for  the  same  software  system.  This  fact 
is  formally  take  into  account  in  [Cheung  80],  where  software  reliability  is  estimated  from  a  Markov 
model  whose  transition  probabilities  depend  on  a  user  profile.  Techniques  for  evaluating  the 
transition  probabilities  for  a  given  profile  are  given  in  [Cheung  75]. 

The  third  category  includes  models  in  which  software  reliability  (and  software  quality  in  general)  is 
postulated  to  obey  certain  laws  [Ferdinand  74],  [Fitzsimmons  78].  Although  such  models  have 
generated  large  amounts  of  interest,  their  general  validity  has  never  been  proven  and,  at  most,  they 
only  give  a  figure  for  the  number  of  bugs  present  in  a  program. 

Finally,  there  have  been  some  attempts  to  characterize  total  system  reliability  (hardware  and 
software)  in  [Costes  78],  and  warnings  about  how  not  to  measure  software  reliability  [Littlewood  79]. 

What  all  the  above  models  have  in  common  is  that  none  of  them  characterizes  system  behavior 
accurately  enough  so  as  to  give  the  user  a  figure  of  guaranteed  level  of  performance  under  general 
workload  conditions.  They  concentrate  in  estimating  the  number  of  bugs  present  in  a  program  but  do 
not  give  any  accurate  method  to  characterize  and  measure  operational  system  unreliability  due  to 
software.  There  is  a  wide  gap  between  the  variables  that  can  be  easily  measured  in  a  running  system 
and  the  number  of  bugs  in  its  operating  system.  However,  a  cost  effective  analysis  should  allow  the 
evaluation  of  software  unreliability  from  variables  easily  accessible  in  an  operational  system,  without 
knowing  the  details  of  how  the  operating  system  has  been  written. 

2.4.2.  Fault-Tolerant  Software 

Fault-tolerant  software  assures  the  reliability  of  the  system  by  use  of  protective  redundancy  at  the 
software  level.  There  are  two  main  strategies  for  obtaining  fault-tolerant  software: 

•  Recovery  Blocks 

•  N-Version  programming. 

The  Recovery  Blocks  (RV)  strategy  [Randell  75,  Lee  79]  consists  of  three  entities:  A  primary 

alternate  (A.,),  an  acceptance  test  (AT),  and  a  list  of  supplementary  alternates  (A2 . AN1).  Upon 

normal  execution,  A1  is  executed  first.  If  AT  is  passed,  normal  computations  proceed.  If  AT  is  not 
passed,  a  purging  of  data  is  performed  and  a  new  alternate  is  called.  Some  modeling  efforts  for  the 
Recovery  Blocks  strategy  have  been  reported  in  [Hecht  76]. 
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The  N- Version  programming  (NV)  strategy  [Avizienis  75,  Avizienis  77]  requires  N  >  1  independently 
designed  programs  (versions)  for  the  same  function.  The  results  after  each  stage  of  computation  are 
compared  and  in  the  case  of  disagreement,  a  preferred  result  is  identified.  If  redundant  hardware  is 
available,  the  N  versions  can  be  executed  concurrently.  Otherwise,  a  performance  penalty  is  paid 
since  the  N  versions  have  to  be  executed  serially  on  the  same  hardware.  In  [Grnarov  80]  the 
processing  times  and  reliability  performance  of  the  RV  and  NV  strategies  are  compared. 

Because  of  development  costs,  fault-tolerant  software  can  be  found  in  only  a  few  systems  with 
exceptional  reliability  requirements,  such  as  space  or  military  systems.  Thus  fault-tolerant  software 
will  not  be  modeled  here  since  it  is  not  available  in  the  majority  of  commercial  systems. 

2.4.3.  Discussion 

In  [Glass  81]  a  study  about  "persistent"  software  errors  is  summarized.  A  software  error  is  defined 
to  be  persistent  if  it  eludes  early  detection  efforts  and  does  not  surface  until  the  software  is 
operational.  One  of  the  findings  of  this  study  is  that  a  large  percentage  of  persistent  software  errors 
are  instances  of  the  software  not  being  sufficiently  complex  to  match  the  problem  being  solved.  It 
seems  as  if  the  programmers  were  straining  to  comprehend  the  complex  interrelationships  of  a 
problem  solution  and  failed.  The  analysis  section  of  a  Software  Problem  Report  presented  by  [Glass 
81]  as  a  typical  example  literally  describes  the  cause  of  a  bug  as  "insufficient  brain  power  applied 
during  design".  A  large  number  of  errors  are  the  result  of  a  predicate  not  having  enough  conditions, 
or  of  a  variable  not  being  reset  to  some  value  after  a  major  piece  of  code  has  finished  dealing  with  it. 

Unreliability  due  to  software  in  operational  systems  is  therefore  mainly  due  to  persistent  errors. 
That  is,  the  complexity  of  the  data  to  be  processed  has  been  oversimplified  in  some  situations.  When 
one  of  these  situations  arise,  a  software  error  is  generated.  Since  once  it  has  been  written  the 
software  does  not  change,  one  would  be  tempted  to  view  the  software  and  all  its  attributes  as  static 
entities.  However,  this  is  not  what  is  observed  in  most  operational  systems.  Although  the  software  is 
static,  the  complexity  of  the  data  to  be  processed  changes  dynamically  according  to  workload  and 
use  of  the  system.  Therefore,  the  view  of  software  reliability  as  a  static  property  may  be  useful  for 
software  designers,  but  it  is  certainly  inadequate  for  users  wishing  to  evaluate  the  impact  of  software 
unreliability  in  a  variety  of  working  environments.  The  observed  software  unreliability  in  an 
operational  system  is  a  dynamic  attribute  depending  (at  least)  on  the  following  two  factors: 

•  How  much  the  software  is  used  (number  of  executions  per  unit  time) 
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•  In  what  way  is  the  software  used  (what  is  the  complexity  of  the  data  to  be  processed) 


These  two  points  will  be  elaborated  in  Section  2.5.2. 


2.5.  Model  verification  requirements 


The  previous  sections  have  summarized  some  current  problems  associated  with  reliability 
characterization  of  digital  computing  systems.  As  stated  in  Chapter  1 .  the  main  goal  of  this  thesis  is  to 
provide  a  modeling  methodology  able  of  providing  better  reliability  characterization,  particularly 
relating  to  the  effects  of  hardware  transients  and  software  faults  in  operational  systems.  The  following 
constraints  have  also  been  imposed: 

•  The  characterization  should  be  user  oriented.  That  is,  it  should  provide  users  with  a  set  of 
tools  to  evaluate  the  impact  of  unreliability  due  to  software  and  hardware  transients.  This 
contrasts  sharply  with  most  software  reliability  models,  which  are  oriented  to  help  the 
designer  to  meet  a  static  requirement. 

•  No  matter  how  complex  the  model  may  be,  the  results  should  be  easy  to  understand  and 
apply. 

•  Model  parametrization  must  be  possible  from  easily  measurable  variables  in  operational 
systems.  Situations  such  as  the  ones  created  after  the  introduction  of  "coverage" 
(conceptually  attractive,  but  impractical  to  measure  in  real  systems)  should  be  avoided. 

•  The  model  must  be  validated  by  contrasting  its  predictions  with  the  behavior  of  real 
systems. 


The  last  restriction  is  particularly  important  since  validation  will  be  possible  only  with  systems  which 
have  the  necessary  measuring  tools  already  incorporated.  Because  of  its  availability  at  CMU, 
validation  will  be  made  with  the  TOPS-10  Operating  System,  a  MULTICS  like  Time  Sharing  system. 
Since  most  commercially  available  Time-Sharing  systems  have  protection  mechanisms  based  on  the 
original  MULTICS  design,  this  is  not  a  particularly  restrictive  constraint.  However,  MULTICS 
protection  mechanisms  themselves  may  give  some  hints  about  how  the  analysis  should  be  started. 
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2.5. 1 .  The  MULTICS  Time  Sharing  system 

MULTICS  (MULTiplexed  Information  and  Computing  Service)  [Organick  72]  was  designed  in  the 
mid  1960’s  as  a  prototype  of  a  computer  utility.  Among  other  goals,  it  was  to  provide  convenient 
remote  terminal  access,  continuous  operation  analogous  to  that  of  electric  power  or  telephone 
companies,  and  the  ability  to  support  different  programming  environments.  Thus,  one  of  the 
requirements  was  to  provide  facilities  for  the  protection  of  concurrently  executing  programs.  The 
protection  mechanism  proposed  for  MULTICS  (and  originally  implemented  in  software)  was  named 
rings  of  protection.  Conceptually,  an  executing  program  segment  in  MULTICS  is  executing  in  one  of  a 
set  of  concentric  rings.  A  program  can  access  programs  and  data  in  the  rings  outer  to  its  ring.  But 
data  in  inner  rings  is  only  accessed  through  predefined  "gates".  By  subsetting  the  segments  of  a 
process  into  rings  and  by  effectively  controlling  interactions  and  communication  between  segments 
in  different  rings,  MULTICS  provides  the  potential  to  isolate  trouble  and  limit  damage.  Different  rings 
are  equated  to  different  levels  of  damage. 


Figure  2*  1 :  Rings  of  protection  in  a  typical  Time  Sharing  system 

Later  systems  have  typically  four  rings  of  protection  (Figure  2-1)  and  have  the  necessary  hardware 
mechanisms  to  enforce  protection  across  them.  The  innermost  ring  or  kernel  is  the  most  privileged 
and  the  closest  to  the  hardware.  I/O  interrupt  routines,  schedulers,  pagers,  and  the  most  critical 
operating  system  data  structures  reside  in  the  kernel.  The  outer  rings  have  different  levels  of  privilege 
and  responsibility.  In  a  typical  partition,  I/O  formatting  and  operating  system  services  are  executed  in 
the  next  ring  or  executive ,  command  parsing  and  real-time  jobs  execute  in  the  following  ring  or 
supervisor  and  the  last  ring  (the  least  privileged)  is  reserved  for  the  execution  of  user  processes  and 
run-time  libraries. 
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2.5.2.  A  starting  point 

One  of  the  nice  properties  of  having  rings  of  protection  is  that  software  singularities  are  readily 
identified.  Assuming  perfect  recovery  from  faults  in  the  outer  rings,  the  entire  system  collapses  only 
when  the  kernel  software  cannot  execute.  But  the  kernel  software  will  always  execute  properly  unless 

•  Hardware  transient  errors  corrupt  the  kernel  code  or  data  structures 

•  The  kernel  software  itself  contains  faults  that  under  certain  conditions  corrupt  itself  or  its 
data  structures 

•  Some  hardware  does  not  exist  because  of  a  permanent  hardware  fault 

The  observed  reliability  due  to  hardware  transients  and  software  faults  will  therefore  depend  on 
how  much  the  kernel  is  used.  Given  that  transients  occur  at  random,  the  longer  the  systems  executes 
in  kernel  mode,  the  more  likely  is  that  a  transient  will  affect  the  kernel  software  or  its  data  structures. 
Analogously,  the  probability  that  a  software  fault  will  manifest  itself  as  an  error  will  increase  as  the 
kernel  is  exercised  more  and  more.  Thus,  the  observed  unreliability  due  to  hardware  transient  faults 
and  software  faults  should  be  a  function  of  kernel  usage. 

The  assumption  of  having  perfect  recovery  from  faults  in  the  outer  rings  is  too  strong  and  can  be 
partially  relaxed.  In  fact,  the  two  main  assumptions  on  which  the  thesis  is  based  are: 

•  Software  faults  in  the  kernel  are  more  likely  to  lead  to  a  system  failure  than  software  faults 
in  any  of  the  outer  rings 

•  A  transient  affecting  the  operation  of  the  kernel  software  is  more  likely  to  lead  to  a  system 
failure  than  a  transients  affecting  other  software. 

These  two  assumptions  are  compatible  with  the  presence  of  software  faults  in  outer  rings  which  may 
abort  single  jobs,  or  even  occasionally  crash  the  system.  The  assumptions  refer  to  the  average 
behavior  of  a  system  in  steady  state  operation  and  do  not  negate  the  possibility  of  pathological 
situations  (such  as  the  possibility  of  having  a  software  fault  that  crashes  the  system  in  an  almost  zero 
load  situation).  These  two  assumptions  only  suppose  that  such  pathological  cases  are  rare. 

In  Chapters  3  to  7,  the  consequences  of  the  above  two  assumptions  will  be  rigorously  formulated, 
validated,  and  an  investigation  of  their  main  implications  will  be  made.  At  the  end  of  Chapter  7,  the 
modeling  methods  derived  from  these  two  assumptions  will  be  combined  with  traditional  modeling 
tools  and  the  possibility  of  permanent  hardware  failures  will  be  also  taken  into  account.  Thus,  a 
combined  model  taking  into  account  the  effects  of  transient  hardware  failures,  software  failures,  and 
permanent  hardware  faults  will  be  introduced. 
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2.6.  Summary 

This  chapter  has  summarized  some  of  the  problems  associated  with  the  characterization  of 
computing  systems  reliability.  In  particular,  it  has  been  shown  how  independent  reliability  evaluation 
and  performance  evaluation  is  itself  a  problem.  The  main  causes  of  system  unreliability  (hardware 
transients  and  software  faults)  have  also  been  described,  along  with  current  modeling  efforts. 

Modeling  methodologies  for  hardware  transient  faults  and  software  faults  are  completely 
independent,  probably  because  these  modeling  methodologies  are  a  response  to  designer's  needs, 
and  component  designers  rarely  interact  with  software  designers. 

The  approach  adopted  in  this  thesis  to  characterize  reliability  at  the  system  level  is  to  put  more 
emphasis  on  what  will  be  observed  while  paying  less  attention  to  how  and  why  a  given  error  occurs. 
The  main  implicit  assumption  throughout  the  thesis  is  that  reliability  of  complex  systems  can  be 
characterized  by  examining  the  patterns  of  usage  of  system  singularities.  The  more  a  singularity  is 
used,  the  more  likely  it  is  that  a  failure  will  be  observed. 

For  (ideal)  Time  Sharing  systems,  the  main  singularity  is  the  kernel  of  the  operating  system.  The 
kernel  can  be  damaged  either  because  of  transients  or  because  of  kernel  software  errors.  The  more 
the  kernel  is  exercised,  the  more  likely  that  a  transient  will  affect  its  operation  or  that  a  software  fault 
will  generate  an  error.  The  formal  framework  for  this  approach  is  presented  in  the  following  chapter. 
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Chapter  3 

Mathematical  formulation 

This  chapter  gives  the  mathematical  basis  of  a  model  capable  of  predicting  the  unreliability  of 
digital  computers  due  to  hardware  transients  and  software  faults.  The  results  are  essentially 
theoretical  and  will  be  validated  by  means  of  analyzing  real  systems  behavior  in  subsequent  chapters. 
Although  the  main  goal  is  to  develop  a  mathematical  framework  suitable  to  the  characterization  of  the 
reliability  of  MULTICS  like  Time  Sharing  systems,  the  results  obtained  in  this  chapter  are  expected  to 
apply  to  a  wider  class  of  complex  systems,  namely,  those  systems  with  a  failure  rate  that  can  be 
approximated  by  either  a  stationary  or  cyclostationary  Gaussian  process.  All  the  approximations  and 
specializations  to  computing  systems  analysis  will  be  worked  out  in  Chapter  4.  The  results  presented 
in  this  Chapter  are  closer  to  applied  probability  theory  than  to  computing  systems  characterization. 
For  the  reader  not  interested  in  strictly  mathematical  results  the  introduction  to  Section  3.2,  Section 
3.2.1 ,  and  the  summary  at  the  end  of  the  chapter  should  be  enough  to  give  an  idea  of  the  main  results. 

In  Section  3.1  the  necessary  definitions  are  given  and  the  notation  used  through  the  thesis  is 
introduced.  Section  3.2  is  devoted  to  the  description  of  the  process  underlying  the  unreliable 
behavior  of  digital  computer  systems.  The  emphasis  is  not  on  why  and  how  often  faults  are  generated, 
but  on  what  the  system  is  doing  when  an  error  is  detected.  The  reliability  of  the  system  is  shown  to 
depend  on  an  integral  converging  to  a  Gaussian  random  variable  and,  more  generally,  to  a  Wiener 
process.  However,  its  evaluation  requires  some  statistics  which  are  impractical,  if  not  impossible,  to 
evaluate  from  real  systems.  Thus,  in  Section  3.3  an  approximation  is  given  which  depends  only  on 
easily  measurable  variables.  The  Probability  Distribution  Function  of  the  time  to  failure  is  shown  to  be 
completely  characterized  by  a  single  time  function  in  Section  3.4,  leading  to  a  conceptually  equivalent 
but  simpler  description  of  the  failure  process  than  the  description  based  on  convergence  concepts. 
Finally,  a  summary  of  the  results  presented  in  this  Chapter  is  given  in  Section  3.7. 
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3.1.  Definitions  and  notation 

A  Probability  Space  (Q.X^  is  comprised  of  a  sample  space  12,  a  collection  of  subsets  of  12  forming 

a  sigma  field  (written  a-field)  of  events,  X,  and  a  probability  measure  9.  An  element  weft  is  a  possible 

outcome  of  a  random  experiment.  A  subset  of  12  (a  collection  of  possible  outcomes)  is  called  an  event. 

In  general,  not  all  collections  of  outcomes  are  observable  events.  Probability  theory  de'als  only  with 

events  in  a  cr-field.  A  a-field  of  sets  is  a  collection  of  sets  closed  under  complementation,  union,  and 

countable  unions.  The  reason  for  associating  observable  events  with  a  a  field  is  that  whenever  we 

have  a  sequence  of  observable  events  {A.},  the  fact  that  their  complements  {Ac}  and  countable 
.  ,oo  1  1 

union  Uia1A.  are  also  observable  events  facilitates  the  proofs  of  many  basic  probability  theory 

results.  Finally,  the  probability  measure  9  is  a  function  that  maps  each  set  in  .4.  into  the  unit  interval 

[0,1]. 

Definition  1 :  A  random  variable  x(w)  is  a  function  with  domain  12  and  range  the  real  line 
IR  such  that  for  every  Borel  set  X  in  1R,  the  set  {<j|x(w)eX}  is  in  the  a-field  of  events  X. 

The  definition  ensures  that  the  probability  of  any  event  of  the  form  P({w|x(w)€X})  is  well  defined  for 
any  subset  X  of  %  where  the  Borel  sets  “3  are  the  subsets  of  IR  belonging  to  the  smallest  a-field 
generated  by  the  set  of  all  closed  intervals.  Sometimes  it  will  be  necessary  to  refer  to  all  possible 

events  associated  with  a  random  variable  or  with  a  collection  of  random  variables  x1 . xR.  Such 

collection  of  events  will  be  a  a-field  and  will  be  denoted  by  a(x1 . xk)  meaning  the  smallest  a-field 

containing  all  the  sets  of  the  form 

{w|x1(u)€X1 . xk(ar)€Xk}  X, . XR€$ 

Definition  2:  The  Probability  Distribution  Function  (PDF)  of  a  random  variable  x  is 
defined  as  the  function 

P„(x<$)  =  P({<*>|x(«)<*» 

The  PDF  of  a  random  variable  maps  the  real  line  !R  into  the  unit  interval.  It  is  a  nondecreasing 
function  of  £  and  Px(x<-oo)  =  0,  PJx< oo)  =  1 .  If  there  exists  a  nonnegative  function  Px(u)  such  that 

P(X^«)  a  f  p  (u)  du 
J.  00  * 

the  px  is  said  to  be  the  probability  density  function  (pdf)  of  x. 

Of  particular  importance  is  the  Gaussian  or  Normal  distribution.  If  a  random  variable  x  is  Gaussian 
(or  normally)  distributed,  then 
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(u-m)2 

e  2a2  du  (3.1) 

where  m  and  a2  are  respectively  the  expected  value  and  variance  of  x.  The  normalized  Gaussian 
distribution  (with  zero  mean  and  variance  1)  will  be  noted  <t»(a) 

1  f  “  u2/2 

<J>(a)  =  — : -  /  e  du  (3.2) 

(2*),/2  Zoo  -  ■ 


P(x<a)  = 


1/2, 


r 

1-  CO 


Definition  3:  A  stochastic  process  (x^m);  teT,  is  a  family  of  random  variables  all 
defined  in  the  same  probability  space  and  indexed  by  a  real  parameter  t  that  takes 
values  in  a  parameter  set  T  called  the  index  set  of  the  process. 

The  indexing  parameter  t  will  represent  time  in  all  the  processes  presented  in  this  thesis  and  T  will 
always  be  equal  to  the  real  line  IR,  that  is,  only  continuous  time  processes  will  be  considered.  For  each 
fixed  t€R,  x{(«)  as  a  function  of  u  will  be  a  real  valued  random  variable.  For  each  w€&,  x{(u)  as  a 
function  of  t  will  be  a  real  valued  function  of  time  called  a  realization  or  sample  function  of  the 
process.  The  set  of  all  these  time  functions  is  called  the  ensemble  of  the  process.  A  sequence  (or  a 
countable  stochastic  process)  of  random  variables  x^u),  x2(w),...  is  a  particular  form  of  a  stochastic 
process  in  which  the  index  set  is  the  nonnegative  integers  IN  +  . 

Stochastic  processes  will  always  be  denoted  such  that  time  dependency  will  be  expressed  as  a 
subscript,  while  deterministic  functions  of  time  will  have  the  argument  in  parenthesis.  Thus,  x(  is  a 
stochastic  process  and  h(t)  is  a  deterministic  function  of  >. 

.  The  following  convergence  concepts  will  be  needed  later  in  the  chapter. 

Definition  4:  Convergence  in  probability.  The  sequence  {x^w)}  is  said  to  converge 
in  probability  to  x(«)  if  for  every  e  >  0 

lim  P(  |x  -x|>e  )  =  0 

n—oo 

and  will  be  noted  as 

plim  x  =  x 
am  n 
n-*oo 

Definition  5:  Convergence  in  distribution.  The  sequence  xr  x2,...  is  said  to 
converge  in  distribution  to  x  if 

Hm  Px  (x„<£)  =  Px(x<£) 

n-»oo  n 


and  will  be  noted  as 
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Px  (£)  =>  pxU) 

n 

Convergence  in  distribution  is  weaker  as  it  is  implied  by  convergence  in  probability. 

Definition  6:  A  counting  process  {N((w);  t>tQ}  is  a  stochastic  process  having  the  set 
IN+  =  {0,l.2,..,co}  of  nonnegative  integers  as  its  state  space. 

For  each  weQ,  N{(w)  is  a  piecewise-constant  function  of  t  with  jumps  at  t^u),  t2(w) . tn(w),  the 

values  of  tn  ,...,  t  depending  on  the  realization  of  the  process.  Counting  processes  are  always 
associated  with  point  processes,  the  value  of  N((u)  for  t  <  t  <  t.  being  the  total  number  of  "points" 
generated  up  to  t  All  counting  processes  presented  in  this  thesis  will  be  associated  with  failure 
processes  of  a  given  system,  the  value  of  N{(«)  for  tj<t<tj  t  being  the  number  of  failures  detected  up 

t°w 

Definition  7:  A  renewal  process  is  a  counting  process  where  the  time  durations 
between  consecutive  events  are  positive,  independent,  identically  distributed  random 
variables. 

Renewal  processes  are  commonly  used  for  reliability  modeling.  In  the  case  of  permanent  hardware 
faults,  it  is  assumed  that  after  repair  has  been  done  the  system  is  as  good  as  new.  Thus,  the  times 
between  successive  permanent  hardware  faults  verify  the  conditions  given  in  the  above  definition  and 
the  failure  process  is  usually  assumed  to  be  a  renewal  process. 

Definition  8:  A  Poisson  process  is  a  counting  process  {Nt  ;  t<t0)  with  the  following 
three  properties : 

1.  Pr[N  =0]  =  1 

*o 

2.  For  tQ<s<t ,  the  increment  Ns(  =  Nt-Ng  is  Poisson  distributed  with  parameter  A,-A  , 
where  Af  is  a  nonnegative,  nondecreasing  function  of  t. 

3.  {Nt;t>t0}  has  independent  increments. 

Property  3  is  the  distinguishing  property.  It  means  that  for  a  Poisson  counting  process,  the  number 
of  points  in  nonoverlapping  intervals  are  statistically  independent  random  variables,  no  matter  how 
large  or  small  the  intervals  are  and  no  matter  how  distant  or  close  they  may  be.  The  function  A{  in 
property  2  is  termed  the  parameter  function  of  the  process.  If  A(  is  an  absolutely  continuous  function 
of  t,  it  can  be  expressed  as 

A.  *  f  \  dr 


(3.3) 
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where  \r  is  a  nonnegative  function  of  t  for  t>tQ.  The  function  Xt  is  termed  the  intensity  function  of  the 
process  Nr  At  any  time  t>tQ,  the  intensity  function  \(r)  is  the  instantaneous  average  rate  at  which 
points  occur.  If  N(  is  a  failure  process  \(  is  the  failure  rate  of  the  process. 

Definition  9:  A  Poisson  process  is  said  to  be  homogeneous  when  the  intensity 
function  X,  is  a  constant  X  independent  of  time. 

For  an  homogeneous  Poisson  process,  the  PDF  of  the  time  to  the  next  failure  tf  given  that  the 
system  is  observed  since  time  ts  is  given  by  the  Exponential  distribution 

-A(rt) 

P(t,<T|ts)  =  1  -  e  3  (3.4) 

where  X  is  the  mean  rate  at  which  points  (failures)  are  generated. 

Definition  10:  Whenever  the  intensity  function  A(  is  not  a  constant  but  a  deterministic 
function  of  time  X(t),  the  corresponding  Poisson  process  is  said  to  be  nonhomogeneous. 

For  a  nonhomogeneous  Poisson  process,  the  PDF  of  the  time  to  the  first  failure  is  given  by 

/  h(t)  dt 

P(t,<r|t0)  =  1  •  e  4  (3.5) 

where  h(t)  is  termed  the  hazard  function  of  the  process.  Note  that  by  property  2  in  the  definition  of  a 
Poisson  process,  h(t)At  is  the  probability  of  observing  a  failure  in  the  infinitesimal  interval  [t,t  +  At). 
Thus,  for  a  nonhomogeneous  Poisson  process,  the  probability  of  observing  a  failure  in  different 
infinitesimal  intervals  evolves  as  a  deterministic  function  of  time. 

Definition  11:  Let  x,  be  a  stochastic  process  that  is  an  "outside”  process  influencing 
the  evolution  of  a  counting  process  {Nt;t>tQ}.  N{  is  a  doubly  stochastic  Poisson  process 
with  intensity  process  {At(xt);t>t0}  if  for  almost  every  realization  of  the  process  x(,  N{  is  a 
Poisson  process  with  intensity  process  function  X((x(). 

The  process  x(  carries  the  information  about  how  the  intensity  process  varies,  and  for  this  reason  is 
sometimes  called  the  information  process. 

Definition  12:  A  stationary  process  (in  the  strict  sense)  is  a  stochastic  process  {xt} 

with  the  property  that  for  any  positive  integer  k  and  any  points  t1 . ,tk  and  h  in  T,  the  joint 

distribution  of 

Ol  , . ,x  J 

i  \ 

is  the  same  as  the  joint  distribution  of 
fxtt*h . \  +  k^ 


i 
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Intuitively,  a  process  is  stationary  if  it  has  the  same  joint  statistics  regardless  of  where  the  time 
origin  is  set.  Hence,  if  x{  is  a  stationary  Gaussian  process,  the  joint  distribution  function  of 
{xt  +h . x^  +  h}  is  a  multivariate  Gaussian  distribution  whose  covariance  matrix  is  independent  of  h. 

Definition  13:  The  Autocorrelation  function  Rxx(t1it2)  of  a  process  x(  is  defined  as 


Rx*^i’V  = 

where  E[..]  stands  for  expected  value. 


If  x{  is  stationary  and  real,  R  (t1  ,t2)  depends  only  on  the  time  difference  r  =  |t1  -t2|  and 
R  (r)  =  E{x,  x,} 

XX  '  '  1  t+r  tJ 

Definition  14:  A  stationary  Gaussian  process  will  be  termed  white  noise  if  its 
Autocorrelation  function  is  given  by 

Rxx(t)  =  S(r)  (3.6) 

where  5(x)  is  the  Dirac  delta  function. 


As  will  be  discussed  in  the  following  sections,  the  main  difference  between  white  noise  and  any 
other  stationary  Gaussian  process  is  that  of  predictability.  While  a  maximum  likelihood  estimate  of 
future  values  of  a  process  exists  for  nonwhite  noise  processes,  white  noise  is  essentially 
unpredictable. 

Definition  1 5:  A  stochastic  process  x(t,«)  is  ergodic  in  the  most  general  sense  if  all  of 
its  statistics  can  be  determined  from  a  single  realization  x(t,uQ)  of  the  process. 


Loosely  speaking,  a  process  is  ergodic  if  time  averages  (the  only  ones  that  can  be  obtained  from  a 
single  realization  of  the  process)  equal  ensemble  averages  (i.e.  expected  values).  Obviously, 
ergodicity  can  be  defined  with  respect  to  certain  parameters  of  the  process.  Only  ergodicity  with 
respect  to  the  autocorrelation  function  will  be  needed  in  this  thesis,  which  is  defined  as  follows : 


Definition  16:  A  stochastic  function  is  ergodic  with  respect  to  the  autocorrelation 
function  if 


Rxx<t>  *  T-*co  JT  Xt  +  Txtdt 


with  probability  one. 


(3.7) 


If  ergodicity  of  the  autocorrelation  function  is  satisfied,  the  autocorrelation  function  can  be 
estimated  by  computing  the  above  integral  for  a  finite  record  of  a  single  realization  of  the  process  x(. 
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Definition  17:  A  real  valued,  continuous  time  stochastic  process  is  defined  to  be  a 
cyclostationary  process  with  period  T  if  and  only  if  [Gardner  75] 

1.E{xt)  =  E{xt  +  T} 

2'E(x,xs)  =  E{x,+txs+t}  Vs,t 

that  is,  it  is  a  stochastic  process  with  periodic  mean  and  autocorrelation  functions. 

Definition  18:  A  doubly  stochastic  Poisson  process  will  be  said  to  be  a  cyclostationary 
Poisson  process  if  its  information  process  is  cyclostationary. 

Definition  19:  A  Wiener  process  is  a  stochastic  process  {W^t^tg}  such  that  W(  =0 
and  the  joint  distribution  of  0 

<w« . W  )  Ot^x...^  >  0 

n  0 

is  specified  by  the  requirement  that  the  random  variables  xk  =  Wt  W{  ,  k  =  1,...,n,  be 
independent,  normally  distributed  random  variables  with  k  k'1 

eiW-mvw 

v.n.w,j,,x-w 

In  particular,  note  that  for  fixed  t,  W{  is  a  normally  distributed  random  variable  with  E[W{] » t it  and 
Var[W{]  =  o\  The  Wiener  process  is  an  interesting  abstraction  useful  in  describing  certain  physical 
phenomenon  such  as  the  Brownian  motion  of  a  particle  in  a  fluid.  It  has  curious  mathematical 
properties  such  as  the  fact  that  although  almost  all  sample  functions  are  continuous,  they  are 
nowhere  differentiable.  However,  although  being  nowhere  differentiable,  if  w(  is  white  noise, 

Wf  =  wr  dr  (3.8) 

in  the  sense  that  the  integral  on  the  right  side  of  the  above  equation  has  all  the  formal  attributes  of  a 
Wiener  process. 


Let  C  denote  the  space  of  all  real  valued,  continuous  functions  on  [0,oo)  and  let  C  denote  the 
smallest  cr-field  of  C  where  Wt  is  measurable.  It  can  be  shown  that  there  exists  a  unique  probability 
measure  If  such  that  (Wt,  t<t<oo}  is  a  Wiener  process.  By  definition,  If  is  such  that  W{  is  a  Gaussian 
distributed  random  variable  and  1f(t,a)  will  be  used  to  note 


im,a)  =  P(Wt<a) 


1 

(2wt)1/2 


-x2/2t 


e 


dx 


(3.9) 
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3.2.  The  underlying  failure  process 

The  mathematical  problem  to  be  solved  is  summarized  in  Figure  3-1.  As  it  has  been  explained  in 
Chapter  2,  the  problem  is  to  characterize  the  unreliability  of  a  MULTICS  like  Time  Sharing  System  due 
to  hardware  transient  faults  and  software  faults. 


Figure  3-1  shows  a  situation  in  which  hardware  transient  faults  occur  in  different  components  of 
different  subsystems  (memory,  CPU,  bus)  at  times  tr  t2,....  The  sensitivity  of  the  system  to  the 
presence  of  a  hardware  transient  fault  depends  on  what  the  system  is  doing  at  the  moment  that  the 
transient  occurs.  If  a  transient  occurs  while  the  system  operates  in  kernel  mode  the  system  will  crash 
with  probability  pk.  If  the  system  operates  in  user  mode  at  the  moment  that  the  transient  occurs,  the 
probability  of  a  crash  is  pu.  It  is  assumed  that  Pk>Pu-  The  system  may  also  crash  while  in  kernel  mode 
due  to  the  manifestation  of  a  kernel  software  fault.  The  probability  of  such  an  event  during  time 
interval  At  is  assumed  to  be  psAt  (the  assumption  that  ps  is  constant  will  be  relaxed  in  the  following 
Chapter). 


The  probability  of  observing  an  error  in  a  single  component  is  extremely  small,  and  the  number  of 
components,  very  large.  The  average  Mean  Time  To  Failure  (MTTF)  ol  a  single  component  is  on  the 
order  of  106  hours  (~i03  years)  for  hard  failures  [Hodges  77].  The  number  of  components  varies  from 
103  for  a  small  minicomputer  like  the  PDP-1 1  /40  [Bell  78]  to  105  )C  packages  for  a  supercomputer  like 
the  CRAY-1  [Russell  78].  Hence  the  failure  process  due  to  transients  is  equal  to  the  superposition  of  a 
large  number  of  very  sparse  failure  processes.  It  is  proved  in  [Cinlar  72]  that  this  type  of  superposition 
converges  to  a  Poisson  process.  Thus,  the  system  failure  process  can  be  viewed  as  a  Poisson 
process  with  intensity 


Pk  +  P3  if  the  system  operates  in  kernel  mode  at  time  t 
Pu  otherwise 


(3.10) 


\x  will  be  termed  failure  rate  because  it  is  the  rate  at  which  errors  leading  to  a  system  failure  are 
generated. 


3.2.1 .  The  underlying  intensity  process 


Let  Nr.  .  be  the  counting  process  which  counts  the  number  of  system  failures  in  the  interval 

[t-i  ,*2]-  Whether  the  system  operates  in  kernel  or  user  mode  depends  on  user  requests  for  program 

execution  and  on  program  behavior.  But  it  is  certain  that  requests  to  the  kernel  will  arrive  at  random 

and  that  the  duration  needed  by  the  kernel  to  satisfy  each  request  will  be  also  random.  is  therefore 

a  stochastic  process  and  N.  ,  becomes  a  doubly  stochastic  Poisson  process. 

lV2* 
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Figure  3*1:  Typical  sequence  of  events  relevant  to  the  characterization  of 
the  reliability  of  a  Time  Sharing  computer  system.  System  failures  due  to 
hardware  transients  have  different  probability  of  leading  to  a  system  failure  if 
the  system  operates  in  kernel  mode  than  if  the  system  operates  in  user  mode. 
Kernel  software  faults  can  only  lead  to  system  failures  while  parts  of  the 
kernel  are  executed. 
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Let  Rt  be  the  number  of  requests  received  by  the  kernel  in  a  time  interval  T.  A  common  assumption 
made  in  queuing  theory  is  that  RT  is  a  Poisson  distributed  random  variable, 

P(RT3n)  =  ii^le'T  (3.11) 

1  ni 

where  PT(R  =  n)  is  the  probability  of  receiving  n  requests  in  an  interval  T  and  v  is  the  average  number 
of  requests  received  per  unit  time. 


Operational  policies  and  human  behavior  guarantee  that  in  most  Time  Sharing  systems  v  is  not 
going  to  be  a  constant  but  a  time  varying  function  reflecting  the  workload  of  the  system  at  each  time. 
Thus,  more  generally,  the  probability  of  observing  n  requests  to  the  kernel  in  the  interval  [tvt2] 
becomes  r-  rx2 


P(R 


t/VwO"  f 

,y*n' — ^ 


»(t)  dr 


(3-12) 


where  p(t)  is  the  instantaneous  average  number  of  requests  to  the  kernel  per  unit  time  and 


E[R. 


f *2 

W'J,. '' 


dr 


(3.13) 


is  the  expected  number  of  requests  in  the  interval  [t1,t2]. 


For  a  doubly  stochastic  Poisson  process,  the  probability  density  function  (pdf)  of  the  time  to  the 
first  failure  given  that  the  system  is  started  at  time  t  =  ts  is  given  by  [Snyder  75] 

.  du 


pswy-E[x,4  v"] 


(3.14) 


where  the  expectation  is  taken  over  the  ensemble  realizations  of  on  [t  ,r].  As  shown  in  [Saleh  74] 
the  above  expectation  is  equal  to 

•A,, 


lVT|v  ‘  ~k E  C 9 


(3.15) 


where 


M*/  X‘dt 


(3.16) 


The  statistics  of  Ajt  ^  are  therefore  required.  Note  that  the  problem  of  determining  the  statistics  of 
Af.  .  is  equivalent  to  that  of  finding  the  distribution  of  the  time  that  the  server  of  an  M/G/1  queue  is 
busy  [Kleinrock  75].  The  value  of  Ajt  (  ,  can  be  expressed  as 

AHvy  *  PJVV  +  (Pk  Pu  +  Ps)2-ut 


(3.17) 


where  s;  is  the  duration  needed  to  serve  the  i-th  request  to  the  kernel.  Ar.  .  ,  is  equal  to  a  term  that 
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grows  linearly  with  time  plus  a  random  sum  of  the  random  variables  s.  Intuitively,  for  large  integration 
intervals,  the  expected  number  terms  in  the  interval  will  increase  such  that  the  distribution  of  A,.  , 

should  approach  a  Gaussian  distribution.  This  sum  reminds  one  of  the  central  limit  theorem,  which 
roughly  states  that  as  the  number  of  terms  in  a  sum  of  independent  and  identically  distributed  random 
variables  approaches  infinity,  the  distribution  on  the  sum  approaches  the  Gaussian  distribution.  Here 
though,  the  number  of  terms  in  a  time  interval  is  not  fixed,  but  a  random  variable,  and  the  successive 
summands  may  not  be  independent.  However,  as  will  be  proved  in  the  following  Section,  the  central 
limit  theorem  also  holds  for  Ar,  ,  ,  (under  some  mild  assumptions  made  precise  below).  This  fact  will 
permit  us  to  use  the  Gaussian  distribution  to  compute  expectations  of  the  type  shown  in  (3.15). 


A  stronger  limit  theorem  can  also  be  proved  for  the  distribution  of  A,  .  ,.  The  integral  of  the  failure 

"l 

rate  process  converges,  in  fact,  to  a  Wiener  process.  This  result,  proved  in  the  Section  3.2.3,  will  allow 

us  to  explain  why  the  apparent  hazard  function  of  the  failure  process  is  a  decreasing  function  of  time 

and  will  permit  us  to  compute  its  limiting  value.  Curiously,  the  rate  at  which  Ar,  ,  converges  to  a 

1 V  2J 

Wiener  process  will  be  shown  to  be  one  of  the  parameters  characterizing  the  reliability  of  such 
complex  systems  as  Time  Sharing  computers. 

remark:  If  v  is  a  constant  independent  of  time,  all  the  parameters  characterizing  the  underlying 
intensity  process  are  constant  and  will  be  stationary.  Under  this  assumption,  the  failure  process 
becomes  a  renewal  process.  Indeed,  no  repair  takes  place  after  either  transients  or  software  faults. 
Therefore,  after  each  failure  the  system  is  restarted  and  starts  operating  as  new. 


3.2.2.  The  central  limit  theorem  for  a  random  sum  of  dependent  variables 


A„  t  ,  can  be  rewritten  as 

R 

— -  ^  |  j 


(3.18) 


where  a  *  po  and  fl  =  (Pk*Pu  +  Ps).  Rjt  is  the  number  of  requests  to  the  kernel  in  the  interval  [trt2] 
and  it  is  assumed  to  be  a  Poisson  distributed  random  variable  with  pdf  given  in  (3.12).  s.  is  the  timq 
required  to  satisfy  the  i-th  request.  The  sf  will  be  assumed  to  be  identically  distributed.  It  cannot  be 
assumed,  though,  that  they  are  mutually  independent  since  requests  to  the  kernel  close  in  time  are 
likely  to  be  related  one  way  or  another  (e.g.,  only  a  process  that  has  been  recently  activated  can  be 
deactivated).  However,  it  is  reasonable  to  assume  that  requests  to  the  kernel  separated  by  a  long  time 
are  independent.  Thus,  the  sequence  {sf}  will  be  assumed  to  be  stationary  and  a-mixing,  two 
concepts  that  are  defined  as  follows : 
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Definition  20:  Given  a  sequence  {s.}  of  random  variables,  the  sequence  is  said  to  be 
a-mixing  if  there  exists  a  sequence  {«n}  such  that  for  each  k, 

|P(AnB)  -  P(A)P(B)|  <  an  and  £*n-»0  as  n-»oo  (3.19) 

AEafs, . ,sk) 

B€"<WSk  +  n  +  l . ) 

Definition  21:  A  sequence  {s.}  is  said  to  be  stationary  if  the  distribution  of  the  random 
vector  (Sj,  s.  , . . ,s.  +  k)  does  not  depend  on  i. 


It  is  therefore  assumed  that  s.  and  Sj  +  k  are  approximately  independent  for  large  k  and  the  statistics 
of  {s.}  are  independent  of  the  time  origin.  Define  now 

x,  =  Sj  •  E[Sj]  -  (3.20) 

and  let 

Sk  »  2‘=1xi  (3.21) 

such  that  E[Sk]  *  E[x.]  =  0.  Without  loss  of  generality  assume  t,  =  0,  t2  =  t,  a  *  /?  =  1 .  The  integral  of  the 
failure  rate  process  can  now  be  expressed  as 

Aj  ■  t  +  R,E(s.]  +  SR{  (3.22) 

Let  the  following  conditions  be  defined : 

Condition  1 :  Convergence  of  {Sk}.  If 

P{-^-  <«}  =*^(«)  (323) 

n,/Vo 

The  sequence  {S. }  is  then  said  to  satisfy  the  central  limit  theorem  with  norming  factors 
1/2  K 
n^o. 

Condition  2:  Uniform  continuity  of  {Sk}.  Given  any  small  positive  t  and  rj,  there  is  a 
large  nQ  and  a  small  positive  S  such  that  if  n>  nQ  then 

p{  max  |S -SJ  <  en1/2o  }  >  1-ij  (3.24) 

IM£«  n  K 


Condition  3:  Convergence  of  R(.  Let  R{  be  a  sequence  of  integer  valued  random 
variables  such  that 


MATHEMATICAL  FORMULATION 


37 


p  lim  -i-  =  v  (3.25) 

n-*oo  t 

Anscombe  has  proved  the  following  result  [Anscombe  52) 

Theorem  22:  Suppose  that  {Sk}  satisfies  the  central  limit  theorem  with  norming 
factors  n1/2<r  (Condition  1),  that  the  convergence  is  uniform  in  probability  (Condition  2) 
and  that  R{  is  an  integer  valued  random  variable  satisfying  Condition  3.  Then 

r  SR  -« 

P  i - 77T-  ^  «  i  =*  *(«)  (3.26) 

(t p)'/2a 

That  is,  the  central  limit  theorem  also  holds  for  a  sequence  in  which  the  number  of  summands  is  a 
random  variable  provided  that  Conditions  1  through  3  are  satisfied. 


A  proof  that  the  {Sk}  satisfies  the  central  limit  theorem  when  the  x.  form  a  stationary,  a-mixing 
sequence  can  be  found  in  [Billingsley  79],  a  fact  that  is  stated  precisely  in  the  following  theorem. 


Theorem  23:  Suppose  that  {x.}  is  a  stationary,  a-mixing  sequence  with  an  =  0(n'5), 
and  that  E[x.)  =  0  and  E[x.12]<co.  If 

Sk  =  ,  x.  (3.27) 


then  (Sk)  satisfies  the  central  limit  theorem  with  norming  factors  n1/2«y  where 

Var[S  ]  v—'OO 

— - —  -»  a2  *  E[x2]  +  2  2-k*i  e£xixi+J 

and  the  series  converges  absolutely.  If  <rX),  then 

}  =»«D(a) 

^  n1/2„  J 

n  a 


(3.28) 


(3.29) 


Thus,  a  stationary,  a-mixing  sequence  satisfies  Condition  1.  A  Poisson  distributed  random  variable 
obviously  satisfies  Condition  3  (provided  that  v(t)  is  bounded).  To  use  Anscombe’s  theorem  it  is 
necessary  to  verify  that  a  stationary,  a-mixing  sequence  is  uniformly  continuous  in  probability 
(Condition  2). 


Lemma  24:  Suppose  {x.}  to  be  a  stationary,  a-mixing  sequence  with  an  =  0(n‘5)  and 
let  (Sk)  be  the  sequence  defined  in  (3.27).  Then,  given  any  small  e  and  ij  there  exits  a 
small  positive  5  and  a  large  nQ  such  that  if  n>nQ 

p{  max  |S  -S.  |  <  en1/2a  }  >  1-ij 

*"  |k-n|<Sn  n  k  J 

proof:  It  must  be  shown  that  given  c,jj>0  there  is  a  5>0  such  that  for  n>nQ 


(3.30) 
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p{  |Sn-Sk|>£n1/2<T  }  <  i)  for  any  k  such  that  |k-n|<6n 
In  particular,  it  will  be  shown  that 

p{  |Sn-Sk|>«n1/2(r}  for  any  k  such  that  n<k<(1  +  5)n 


(3.31) 


(3.32) 


can  be  made  arbitrarily  small.  A  similar  argument  would  apply  to  values  of  k  where  (I-i)n<k<n.  By 
Tchebyschev’s  inequality, 


..  1/2  T  .  Var[E,  =  n  +  lXj] 

p  1  \2~nmn+i  xjl  ^  en1/za  J  < - — - - 


2  2 
e no 


r-r-'(l+8)n  -i 

r|V^d+a)n  ,  1/2  Var|_2-'i  =  n+,xiJ 

p  1  |2_j*n  +  ixj|  >«"  <*  j  < - — - — 


2-2 
e  n<r 


where,  by  stationarity 

'k 


Var[Ej  =  n  +  i  xj]  =  (k-n)E[x2]  +  2  Ei=i  (k-n-i)  E[x,xui] 

rri(lt«)n  ~i  ,  vn5n-1 

Var|_z»<j  =  n  +  i  XjJ  =  5nE[x2]  +  2  2_i  =  1  (fin-i)  Efx^,  +j] 


(3.33) 

(3.34) 

(3.35) 

(3.36) 


Thus, 


Var[X^ln?"  xj  -  Var[^=n  +  iXj] 

_  v-»k-n-1  v~>8n-1 

=  (5n  +  n-k)E[x^J  +  2(Sn  +  n-k)jU  =  iE[x1x1+jJ  +  22-,k-n(5ni)Etxixi+i] 


Since  n<k<(1  +  5n),  let  k'n  =  k,  1  <k’<1  +  5.  From  the  properties  of  an  a-mixing  sequence, 
>an-i 

L . 

n-*oo 


and 


Lan-1 

__  (k'-DnCS'/n^x^^.]  =  0 

v.r[E!:;^]-v.[EL.,x,] 


(3.37) 


(3.38) 


lim 

n-»oo 


=  (5  +  1-k')E[x2]  +  2(5  +  1 -k')  Ei<fiE[x1x1+j] 


(3.39) 


The  series  converges  absolutely  for  the  same  reason  that  a2  converges  absolutely  in  (3.28)  (see 
[Billingsley  79]  for  details).  Thus,  if  <r>0,  the  above  limit  is  also  positive.  Therefore,  there  exists  an  nQ 
such  that 


VarCElln^iXi]  •  Var[E|ln  +  ix|] 


n>n„ 


(3.40) 


Hence,  for  n>nQ, 
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lf|Vk  k  1/2  1  ^  Var[E  =  n!\  xj  ] 

*1  |2w,  =  n  +  i  ^il  >  en1  o  J  < - — - — 


2  2 

e  no 


But 


rv,(u5)n  i 

Varl  Z^j  =  n  +  1  xi  J  o  T- 'Sn-i  : 

- 1 - L - ii,  =  5E[x?]  +  2ii=1(6_L)E[x1x1ti] 


(3.41) 


(3.42) 


which,  given  n>nQ,  can  be  made  arbitrarily  small  by  a  proper  choice  of  5.  In  particular,  chose  5  such 

that 

■»(1  +S)n 


Var|_ii  =  n  +  1x.  J 


2,2 


<  i)t  a 


and  (3.30)  follows. 


(3.43) 

I 


It  is  now  possible  to  prove  the  following  theorem  : 

Theorem  25:  Let  {x.}  be  a  stationary  and  a-mixing  sequence  of  random  variables  with 
an  =  O'n  5),  and  let  N(  be  a  sequence  of  Poisson  distributed  random  variables  satisfying 
Cond.t  on  3.  Then, 

r  SN  t 

P[ - 77T-  <  °  J  =»  *<«>  (3.44) 

(t»01/2a 

where  a  has  been  given  in  (3.28). 


proof  Condition  1  holds  by  Theorem  23.  Condition  2  holds  by  Lemma  24.  Condition  3  holds  for  a 
Poisson  distributed  random  variable  if  v(t)  is  bounded.  Therefore,  by  Theorem  22  the  limit  of  the 
random  sum  converges  to  the  Gaussian  distribution.  I 


Corollary  26:  Let  Nft  t  . 
Xt  as  defined  in  (3.10).  It  r  * 


be  a  doubly  stochastic  Poisson  process  with  intensity  process 


p  lim  -1-  /  v(t )  dr  =  v 
t-»oo  t  Jo 


(3.45) 


then,  the  Probability  Distribution  Function  (PDF)  of  the  time  to  the  first  failure  given  that  the 
system  is  under  observation  since  time  ts  is  given  by 


where 


r  i  Var(\J 

P^rig  =  eEL  VlJ  +  ■  "2 

(3.46) 

E[A(t  ,r]]  *  «<T-y  +  ^EP[t 

s  s 

(3.47) 

Var[A{,  r]]  =  p2  (  E[R(,  r]]o2  +  E[Sj]2  Var[R[t  Tj]  ) 

3  3  3 

(3.48) 
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(3.49) 


proof  :  Remember  that 

A[t1.t2]  =  a(t2-V  +  fl  (  R[tvt2]ElSil  +  SR[ti  ^  ) 

where  x.  =  s,.-E[s.]  and  S.  =  XI  - 1  For  large  [t,,t0l,  by  Theorem  25  SD  converges  in  distribution 
to  a  Gaussian  random  variable  with  zero  mean  and  variance 


Var[Sa  ]  =  E[Rr.  ,  ,]<x 


^i*y 


W 


(3.50) 


where  a  is  given  in  (3.28).  Rj(  ^  is  a  Poisson  distributed  random  variable.  Hence,  for  large  [tt,t2]  it  is 
also  asymptotically  normal  with  mean  and  variance 


f1 

K 


v{r)  dr 


Furthermore, 


ECR(,1.ySR  ]  -  E”,  'PWltrt2,*0E(srl  -  o 


Kr'jl 

Therefore, 

Ef.Vj)  . 

Note  that  if  z  is  a  Gaussian  distributed  random  variable  with  mean  m.  and  variance  <r2 


(3.51) 


(3.52) 


(3.53) 


00  z  (zmz> 

E{e  }  =. - 1—  /  e  e  2<t  2  <*z 

(2w)1/2<rz  Jo°  1 


(3.54) 


m‘  +  T- 


Hence, 


and 


,  -/JEfsJRrj  ,  r  ^2E[si]2var(R  ] 

E(e  =  expt  ^ElR^  ^lEts.]  + - - - l-SL  } 


r  .  r^EfRttv«>2  , 

E(e  =  exp{ - -L2- —  } 


(3.55) 


(3.56) 


(3.57) 


and  (3.46)  follows. 


The  distribution  of  {Sk}  not  only  converges  to  the  Gaussian  distribution  for  a  large  (expected) 
number  of  summands,  but  {S^}  also  satisfies  the  so  called  invariance  principle,  a  concept  for  which 
some  more  elaborate  mathematical  tools  are  required  and  which  is  described  in  the  following  Section. 
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3.2.3.  Convergence  to  Wiener  measure 

Let  C  be  the  space  of  continuous  functions  on  [0,1]  and  let  C  be  its  a-field  of  Borel  sets.  For  each 
u€Q  let  p(u)  =  p(u,w)  be  the  function  defined  on  [0,co)  defined  by 

p(u)  =  S[uJ  +  (u-LuJ)x(_uJ  ,  (3.58) 

For  n  =  1 ,2,...  define  pn(u)  =  Pn(u,w)  for  0<t<1  by 

(a59) 

nw<y 

Thus,  pn(  )  is  that  element  of  C  which  is  linear  on  each  interval  £(k-1)/n  ,  k/n  J  and  satisfies 

a 

pn(k/n)  *  -3-  k<n  (3.60) 

n 

Definition  27:  If  Pn(A)  =  P{p  €A}  for  A€C  then  we  say  that  {x.}- satisfies  the  invariance 
principle  with  norming  factors  n  /2cr  if  Pn(A)=*  17(A),  where  17()  denotes  Wiener  measure. 

Now,  for  integers  c, v, n  define  a  =  jn/c  j  =  0,1 c  and  n.  u  =  n(*»(j-1)  +  u)/c*\  j  =  1 c  n  =  0,1 ». 

Definition  28:  For  any  real  numbers  or  ,  /3.  let  En  f  be  the  set  in  12  where  the  relations 
S. 

a  < - ! — </)  if  n.  .<i<n,  (3.61) 

i  —  1/2  i-1  “  I 

n  a 

are  satisfied  for  i<r  but  not  for  i  *  r. 

Define  the  following  two  conditions  : 

Condition  4:  For  any  integer  c 

5„PtS"c'W«S°‘("C,,/2,’5  1-1 . c  (3.62) 

Condition  5:  For  any  integer  c,  any  set  (a, . ac,/8( . /?c)  and  each  eX) 

lim  limsupX]"=1p(E  n{)sr,-Sr|>en1/2a})  =  0  (3.63) 

p—*00  n-^OO  ’ 

where  r'  =  nji|  +  1  is  that  integer  such  that  n.  u<r<a  u  +  1- 

The  conditions  under  which  the  invariance  principle  holds  for  sequences  of  dependent  random 
variables  are  stated  now.  A  proof  of  the  following  theorem  can  be  found  in  [Billingsley  56] 

Theorem  29:  The  invariance  principle  holds  for  the  sequence  {S.}  if  Conditions  4  and 
5  are  satisfied. 

It  will  now  be  proved  that  the  invariance  principle  holds  for  a  stationary,  a-mixing  sequence  of 


random  variables. 
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Theorem  30:  Let  {x.}  be  a  sequence  of  stationary,  a-mixing  random  variables  with 
an  =  0(n  s)  and  EHxJ^oo.  The  invariance  principle  holds  for  {Sk}  with  norming  factors 
(nc)1/2o,  where  a  has  been  defined  in  (3.28) 


proof  :  It  must  be  proved  that  a  stationary  and  a-mixing  sequence  satisfies  Conditions  4  and  5. 
Condition  4  will  be  proved  first.  By  stationarity, 


P(  Sin/c  '  S(i-l)n/c  ^  ain1/2ff  }  “  KC  x,  <  «in1/2Co  } 

(3.64) 

-  p{  1  Xj  <  ai(mc)1/2ff  } 

(3.65) 

Therefore,  by  Theorem  23 

P(  Sin/c  ■  S(i-i)n/c  ^  «i(nc)1/2«r}  =  9(a) 

(3.66) 

Define  now 

Cn  *  E[-(Sin/c-S(j.1jn/c)(Sjn/c-Sfl  1)n/c)  J 

(3.67) 

Also  by  stationarity,  assuming  j  >  i , 

(3.68) 

By  the  definition  of  an  a-mixing  sequence, 

..  rtpn/c  -l  p^~»(i-i  +  1)n/c  -i 

—  EL^~»i  =  1  X-,J  ^L-^-<(G-i)n/c)  +  1  xjJ  +  a(j.j)n/c 

(3.69) 

and 

c!fae[ir,jB[rK5:,j.s*w 

(3.70) 

Since  E[x.]  =  0  and 

lim  a. .  =  0 
n-*oo  0,)n/c 

(3.71) 

it  follows  that 

lim  C'j  =  0  i*»j 
n— oo  n 

(3.72) 

and 

»m  c;  .B[(X^)«] 

n-»oo  n  * 

(3.73) 

2 

a  CO* 

(3.74) 

Thus,  as  n— »oo  the  distribution  of  the  random  vector 


MATHEMATICAL  FORMULATION 


43 


— hr  (WS . W)  (3-751 

(nc )  a  121  c  c-i 

approaches  a  multidimensional  Gaussian  distribution  having  as  covariance  matrix  the  identity  matrix 
and  Condition  4  is  satisfied.  As  for  Condition  5,  note  that 

p(  E„,n{  IVS,I  s  }  )  £  p(  E„,.n  {  |s,-S,  J  >  efi,/2o/2  }  ) 

As  for  the  second  term  in  the  right  hand  of  (3.76), 

P{  En,nlSr  +  m-Srl  >  )  -  pl  |Sr  +  m  Sr|  >  en^a/2  } 

=  p{  >  enu2o/2  } 

<  =  r  P  {  l*„l  >  en1/2ff/2m  } 

Hence, 

2Z".i  p{|S,»m  St|2en,,V2)  £  p{  |x,|  >  en,,2o/2m] 

. ~  '  Sm(2"L)2**_L_ E:„E{|x/*«) 

to  '  n1  +  «/2  '  1  L  "  J 

for  any  5  >  0  by  Tchebyshev’s  inequality.  Chose  now  8  =  2  and  m  =  0(n1/5)  and 
lim  E"=i  P{  |Sr+m-Sr|  >  en1/2<r/2  }  =  0 

n-*oo 

And  now  for  the  first  term  in  (3.76).  By  the  properties  of  an  a-mixing  sequence  and  since  Enr  is 
defined  in  <r(x1,...,xr) 

p(  E„n  {  |s,-S„J  a  «n>/2»/2  }) 


(3.76) 

(3.77) 

(3.78) 

(3.79) 

(3.80) 

(3.81) 

(3.82) 


<e:„ 

P(En,)P{|S,S,.J 

>  cn1/2(r/2  }  +  « 

—  J  m 

(3.83) 

<  (  max 

r 

p{  M,  j  2  5  )  *  «„ 

(3.84) 

<  (  maw 

r 

4Var[|Sr.-Srtm|] 

)  +  a 

(3.85) 

e2ncr2 

/  +  am 

<lh 
—  2 

4mi2 

2 

(3.86) 

where  and  £2  are  bounded.  The  last  inequality  has  been  obtained  taking  into  account  that  r’-r  will 
be  at  most  n/cv  and  rewriting  the  variance  of  |Sr,-Sr  +  |  as  a  function  of  a2.  Therefore 
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im  sup  1  p(  En, n  {  |s,-Sr+ J  >  en,/2<x/2  })<-^- 


limsup  P\  Enr‘  1  t  |Sf.-S  I  >  en  <r/2  J  )  <— — —  (3.87) 

n-°0  t2Cp 

As  n-»co,  the  second  term  in  the  right  hand  side  of  (3.76)  goes  to  0  by  (3.82)  .As  u~*  co  the  first 
term  also  goes  to  0  by  (3.87),  and  (3.63)  follows.  I 

Since  the  sequence  {x.}  satisfies  the  invariance  principle  it  is  now  possible  to  use  the  following 
theorem  also  due  to  {Billingsley  63].  Let  R(  be  a  sequence  of  integer  valued  random  variables.  For  a 
realization  of  the  sequence,  R,(«)  let  £.,(«),  £2(u),....  be  the  successive  discontinuities  of  R((w)  as  a 
function  of  t,  so  that  R( « i  if  {Xt^^  +  r  Define  now 
t-& 

o»  _  ;  .  i  :s  r  t  /o  an\ 


R,‘  =  i  +  T“T 

'i+i  5i 


»^i+i 


(3.88) 


Thus,  R’t  is  that  function  of  t  which  is  linear  on  each  interval  +  ^  and  agrees  with  R{  at  its  jumps. 
Define  now  q(t)  =  p(R’t),  where  p(  )  has  been  defined  in  (3.58),  and  qu(t)  =  q(ut)/(<'U)1/2a.  Define  a 


measure  on  C  by  QU(A)  =  P{qu€A}. 


Theorem  31 :  If  {Sj}  satisfies  the  invariance  principle  with  norming  factors  n1/2o  and 

{,  R  •  PT 

t^oo  sup*<‘  '  —  I  J  =  0  (3 

Then  {SR  }  satisfies  the  invariance  principle,  that  is,  Qu(A)=>W(A). 

It  is  now  possible  to  prove  the  convergence  of  A,.  .  ,  to  a  Wiener  process. 

lV2* 

Corollary  32:  Let  X(  be  the  failure  rate  process  defined  in  (3.10).  As  the  integration 
interval  approaches  infinity,  the  integrated  rate  A(  converges  in  distribution  to  a  Wiener 
process  W{  with 


(3.89) 


E[W(]  =  (a  +  ^E{Sj]r)t 
Var[Wt]  =  P2p{az  +  E(s,]a)t 


(3.91) 


proof  :  The  proof  is  identical  to  that  of  CoroHary  26.  Just  note  that  R,  converges  also  to  a  Wiener 
process  independent  of  SR  .  Further,  note  that  (3.89)  is  satisfied  since 


(3.92) 


(3.93) 


which  goes  to  zero  as  t-»  oo. 
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By  the  definition  of  Wiener  measure, 
Sn 


P{ 


ut 


(u)1/2o 


<  a  }  =>  “W[t,ar)  = 


1 


(2wt)1/2  J0° 

Hence,  the  invariance  principle  implies  the  central  limit  theorem.  However,  the  invariance  principle  is 
a  much  stronger  limit  as  it  also  implies  that  A,  has  independent  increments.  That  is,  as  u  approaches 
infinity  the  random  variables 


r 

J.QO 


o 

•xV2t 


dx 


SR  and  S_ 

u^,ktk  + 1^ 

are  independent,  normally  distributed  random  variables.  This  result  could  never  be  obtained  from  the 
central  limit  theorem. 


3.3.  The  observable  process 

In  the  previous  section  the  PDF  of  the  time  to  failure  of  a  computing  system  has  been  characterized 
by  some  convergence  limits.  The  expressions  obtained  depend  on  some  statistical  properties  of  the 
time  that  the  kernel  operates  in  kernel  mode.  In  particular,  they  depend  on  the  variance 

a2  =  E[x^]  +  2E“  E[xiXi  +  J  (3.94) 

where  the  x.  are  the  service  times  of  successive  requests  to  the  kernel.  Unfortunately,  the 
measurement  of  E[x1x1  +k]  is  not  likely  to  be  possible  on  real  systems.  The  kernel  is  executed  at  least 
once  per  line  clock  tick,  60  clock  ticks  per  second.  To  estimate  the  above  statistic,  either  a  complex 
hardware  monitor  is  required  or  the  entire  kernel  software  has  to  be  modified  such  that  at  the  start 
and  end  of  each  service  a  time  stamp  is  recorded  somewhere.  Both  approaches  are  cumbersome  and 
impractical  for  operational,  commercially  available  computing  systems.  Since  one  of  the  premises  of 
the  present  work  is  that  any  mathematical  characterization  must  be  verifiable  from  easily  measurable 
variables  in  operating  computers,  an  alternate  way  is  required. 


3.3.1 .  The  observable  intensity  process 


Let  the  process  X,  be  defined  as  follows  : 

/t+w/2 

\  dr 
W/2  r 

that  is,  Xt  is  the  result  of  averaging  X{  over  an  interval  of  duration  W.  The  question  now  is 

/*2  ?  r  l2 

X  dr  ~  /  X,  dr 

k  T 


(3.95) 
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If  the  integral  of  X,  can  be  approximated  by  the  integral  of  X,  the  situation  is  much  better.  Most 
operating  systems  automatically  measure  the  cumulative  time  in  kernel  mode,  such  that  the  values  of 
X(  can  be  easily  sampled.  Fortunately,  the  answer  is  affirmative.  The  exact  value  of  the  integral  of  X,  is 

i2  ,*2  -t  +  W/2 

/  X  dr  =  -1-  /  /  X  dr  dt 

A  T  W  J,  Aw/2  T 


(3.96) 


.  +W/2 

■j U 


(3.97) 


where  W"  .  is  the  window  function  shown  in  Figure  3-2. 


Figure  3-2:  Window  function  used  to  obtain  the  integral  of  X, 


That  is. 


fx 2 

l  MT*j(  \dr  +  ew 


(3.98) 


The  absolute  value  of  the  error  term  depends  only  on  w.  As  the  integration  interval  [t, ,t2]  increases, 
the  error  term  remains  constant. 


Given  a  realization  of  X(,  X(  is  defined  such  that 
~  7*2 

V2 i“A  ^dt 


(3.99) 


can  be  used  as  an  approximation  of  Ar.  ,  ,.  Thus,  the  pdf  of  the  time  to  failure  can  be  approximated 

11  IT* 

by 


p(T|to)  -  E  {e  W  } 


(3.100) 
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Now  let  the  value  of  the  averaging  window,  w,  be  sufficiently  large  that  the  central  limit  theorem 
holds  for  X(.  For  fixed  t,  X,  can  be  approximated  by  a  Gaussian  random  variable.  This  assumption  is 
consistent.  Let  w  =  10  sec.  Xt  is  then  equal  to  the  sum  of  ~103  random  variables.  But  typical  MTTF 
values  are  on  the  order  of,  at  least,  hours.  Hence,  the  evaluation  of  A(  will  be  based  on  an  integral 
over,  say,  10  hours.  The  error  term  is  equal  to  an  integral  over  a  period  of  10  sec.,  and  therefore  can 
be  neglected. 


X,  then  becomes  a  Gaussian  stochastic  process,  and  A,,  ,,  being  the  integral  of  a  Gaussian 

iVijJ 

process  over  a  finite  interval,  will  obviously  be  a  Gaussian  random  variable.  If  X,  is  a  Gaussian 
stochastic  process  with  mean  E[X]  and  autocorrelation  function  R^(s,t)  ,  A^  ^  is  a  Gaussian 
random  variable  with  mean 


E[A, 


dt 


and  variance 


VarI\,t2]l  *  2  IT  [r^(s,D  •  E[Xs]E[Xt]]  ds  dt 


(3.101) 


(3.102) 


1  '1 

(see  [Papoulis  65],  pp.  323:325).  Hence, 


.e[l  3 .  Var[V|]. 

P(t,<r|t3)  .  e  [,3't)  2  (3.103) 

The  difference  between  (3.103)  and  (3.46)  is  that  the  values  of  (3.101)  and  (3.102)  are  much  easier  to 
estimate  from  an  operational  system  than  the  values  of  (3.47)  and  (3.48).  To  estimate  (3.101)  and 
(3.102)  all  it  is  needed  is  a  sequence  of  sample  values  of  the  fraction  of  time  in  kernel  mode.  And  this 
is  an  easily  observable  sequence. 


3.4.  The  equivalent  failure  process 

Expression  (3.103)  gives  the  PDF  of  the  time  to  the  first  failure  given  that  the  system  is  observed 
starting  at  time  tg.  Given  E[Xt]  and  R<^(s,t)  ail  the  functions  on  which  P(tf<r|ts)  depends  are  known 
and  deterministic.  Expression  (3.103)  can  therefore  be  viewed  as  the  PDF  of  a  nonhomogeneous 
Poisson  process. 

remark  :  The  fact  the  the  PDF  of  the  failure  process  introduced  in  Sections  3.2  and  3.3  is 
equivalent  to  the  PDF  of  a  nonhomogeneous  Poisson  process  with  PDF  given  in  (3.103)  does  not 
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mean  that  the  two  processes  are  indistinguishable.  It  only  means  that  the  statistics  of  the  time  to  the 
first  failure  are  indistinguishable.  However,  if  Xt  is  stationary  such  that  the  failure  process  is  a  renewal 
process,  then  the  process  with  stochastic  intensity  and  the  nonhomogeneous  Poisson  process  are 
truly  equivalent. 

A  nonhomogeneous  Poisson  process  is  a  much  simpler  conceptual  framework  to  work  with  than 
the  situation  described  in  the  previous  sections  of  this  chapter.  A  nonhomogeneous  Poisson  process 
is  completely  characterized  by  its  hazard  function,  a  deterministic,  time  varying  function.  Thus,  if 
reliability  characterization  can  be  made  based  only  on  the  distribution  of  the  time  to  failure,  the 
hazard  function  of  the  equivalent  failure  process  is  all  that  is  needed. 


3.4.1 .  The  hazard  furction 


A  nonhomogeneous  Poisson  process  is  completely  specified  from  its  hazard  function.  From  (3.5) 
note  that 


h(t) 


P(t) 

1-P(t,<t) 


Thus,  from  (3.103), 


(3.104) 


h(t) 


3E[At]  1  dVar[At] 

at  *  2  8t 


(3.105) 


From  (3.90),  (3.91),  and  (3.105)  it  can  be  seen  that  for  large  integration  intervals  the  hazard  function 
becomes  a  constant  independent  of  time.  If  the  system  has  been  started  at  t  =■  0,  the  quantity  h(t)At  is 
the  probability  that  a  failure  will  occur  in  the  interval  [t.t  +  At).  Therefore,  as  the  integrated  failure  rate 
converges  to  a  Wiener  process,  the  hazard  function  of  the  equivalent  process  converges  to  a 
constant  independent  of  time.  In  that  case,  the  equivalent  failure  process  degenerates  to  a 
homogeneous  Poisson  process.  For  a  homogeneous  Poisson  process,  the  number  of  points  in 
disjoint  time  intervals  are  statistically  independent.  This  in  turn  implies  that  the  random  variables 

,l2 


dr 


and 


A*-/\ 


dr 


are  statistically  independent.  Therefore,  the  convergence  to  a  constant  hazard  function  could  not 
have  been  guessed  from  the  central  limit  theorem  alone  since  in  general,  A,  and  A2  will  not  be 
independent  being  the  sum  of  an  a-mixing  sequence. 
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3.5.  Summary 

This  chapter  started  with  the  assumption  that  the  failure  rate  of  a  Time  Sharing  computer  is 
continuously  switching  between  two  states.  While  in  each  state,  the  system  has  a  given  sensitivity  to 
the  presence  of  hardware  transient  faults  and  software  faults.  The  PDF  of  the  time  to  failure  depends 
on  the  integral  of  the  failure  rate.  As  the  integration  interval  becomes  much  larger  than  the  rate  at 
which  transitions  between  states  occur,  the  integrated  failure  rate  converges  first  to  a  Gaussian 
distributed  random  variable,  and  for  longer  integration  intervals,  to  a  Wiener  process. 

This  description  has  been  completed  by  an  approximation  where  the  failure  rate  is  not  a  two  state 
process,  but  a  Gaussian  process  resulting  from  averaging  the  real  failure  rate  over  a  short  period  of 
time.  In  the  case  of  digital  computers  this  is  just  an  approximation,  but  in  other  systems,  a  Gaussian 
process  may  be  the  actual  failure  rate. 

It  has  been  then  shown  how  the  statistics  of  the  time  to  failure  are  completely  determined  by  the 
expected  value  and  variance  of  the  process  A( ,  the  integrated  failure  rate.  Once  these  two  moments 
are  known,  the  failure  process  can  be  viewed  as  a  nonhomogeneous  Poisson  process  since  all  the 
functions  on  which  the  PDF  depends  are  deterministic.  The  PDF  and  hazard  function  of  the 
equivalent  failure  process  are  given  in  (3.103)  and  (3.105).  Both  depend  only  on  the  expected  value 
and  variance  of  the  integrated  failure  rate  since  the  system  start  time.  These  two  moments  depend  in 
turn  on  the  frequency  with  which  requests  arrive  to  the  kernel,  and  on  some  statistics  about  how  the 
requests  are  served.  The  following  chapter  specializes  these  results  to  two  special  cases  of  wide 
applicability. 
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Chapter  4 

Specialization  to  systems  under 
constant  or  periodic  workload 

In  Chapter  3  the  emphasis  was  on  studying  the  more  general  properties  of  systems  characterized 
by  a  new  modeling  methodology.  This  chapter  derives  some  important  properties  for  systems  for 
which  some  more  information  is  available.  The  system  workload  will  be  assumed  now  to  be  either 
constant  or  periodic.  Nevertheless,  it  should  be  clear  that  periodicity  or  invariance  is  only  average 
behavior.  The  actual  failure  rate  is  still  assumed  to  be  a  stochastic  process. 

4.1.  Case  I  -  Constant  workload 

If  the  workload  of  the  system  is  constant  and  the  system  is  operating  in  steady  state,  it  is  reasonable 
to  assume  that  the  expected  number  of  requests  arriving  to  the  kernel  per  unit  time,  »»(t)  will  be  equal 
to  a  constant  v.  In  this  case,  the  probability  of  observing  n  requests  in  a  time  interval  (t2-t1)  =  T  is  given 
by  (3. 1 1 ).  Therefore,  X(  becomes  a  stationary  Gaussian  process  with  mean 


t  +  w/2 

■w-WjL  Mt}  (4,) 

*  «  +  £"E[Si]  (4.2) 

=  <*  +  P  m  (4.3) 

where  m  is  the  average  fraction  of  time  in  kernel  mode.  Define  then, 

E[X]  ^  q  (4.4) 

and 

E[Xt]»qt  (4.5) 


Since  X,  is  stationary,  its  autocorrelation  function  R<^(s,t)  depends  only  on  the  difference  r  =  |s-t|  and, 
from  (3.102), 
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-t 

Var[  A,  ]  =  2  Jq  (t-r)  [r^(t)  -  q2  ]  dr 

Let  now 

yt  = 

where  y(  is  a  stationary,  zero  mean,  Gaussian  process,  and 

.t 

Var[At]  =2^  (t-r)  R^r)  dr 

The  equivalent  nonhomogeneous  failure  process  has  then  the  following  important  properties : 

Corollary  1:  Let  N(  be  a  failure  process  with  failure  rate  process  as  defined  in  (3.10) 
with  constant  workload  {v(t)  =  v)  such  that  the  failure  rate  can  be  approximated  by 

\  *  q  +  yy 

where  yt  is  a  zero  mean  stationary  Gaussian  process.  Define 
oo 

W-/a,  VT,dT  • 


(4.6) 


(4.7) 


(4.8) 


(4.9) 


(4.10) 


The  statistics  of  the  time  to  failure  are  then  equal  to  the  statistics  of  the  time  to  failure  of  a 
nonhomogeneous  Poisson  process  with  hazard  function  h(t)  such  that 

1.  h(0)  »  q 

2.  lim  h(t)  =  q--^  ,W>0 

t-»oo  2 

3 .  If  Ryytr)  is  nonnegative  everywhere,  then  h(t)  is  nonincreasing. 

proof  :  The  hazard  function  of  the  equivalent  process  is  given  by  (3.105).  Substituting  (4.6)  and 
(4.5)  in  (3.105), 

h(t)«q-j£  Fyr)dr  (4.11) 

and  h(0)  a  q.  For  real  processes,  the  autocorrelation  is  even  and 


lim  h(t)  *  q  -  h(0) 

t-*oo  « 


(4.12) 


Note  that  W>0  because  if  Syy(w)  is  the  power  spectrum  of  yy 


Vw)-/oo  VT,eiWTdT 


then  W  =  S^fO),  which  must  be  nonnegative  for  any  phisical  process. 


(4.13) 
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Finally,  if  the  autocorrelation  function  is  nonnegative  its  integral  is  nondecreasing  and  h(t)  as  given 
in  (4. 1 1 )  must  be  nonincreasing.  I 

4.1.1.  Examples 

A  complete  family  of  distributions  can  now  be  obtained  for  the  case  of  constant  workload  but 
different  autocorrelation  functions.  The  only  restrictions  are  that  being  real  processes,  the 
autocorrelation  functions  must  be  even,  positive  definite,  with  a  maximum  at  r  =  0  and  their  integral 
over  the  real  line  must  be  nonnegative  and  bounded.  The  following  examples  illustrate  some  types  of 
distribu  ;ions  that  can  be  obtained  under  the  assumption  of  constant  workload. 

4. 1.1.1.  Example  1.  Exponentially  decreasing  hazard  function  -  The  doubly  exponential 
distribution 


W  *|8M 

/>• 


then,  the  PDF  of  the  time  to  failure  is  given  by 
,  W  .  W  r< 

P(tj<r)  »  1  -  e  q  2  T  2fi  e 


and  its  hazard  function  is 

\A l  -/Jr 

h(r)  *  q  •  —  [1  •  e  ] 

h(oo)  =  q--|' 


(4.14) 


(4.15) 


(4.16) 

(4.17) 


Note  that,  as  for  any  nonhomogeneous  Poisson  process,  the  hazard  function  is  the  derivative  of  the 
exponent  in  the  PDF.  In  particular,  if  q  =  /3  =  1,  W  =  2, 
e'T-i 

P(t,^r)  -  1  -  a  (4.18) 

h(t)  *  e  *  (4.19) 


which  is  the  doubly  exponential  distribution,  one  of  the  three  possible  (maximum)  extreme  value 
distributions  (given  that  t,  must  be  nonnegative).  Maximum  extreme  value  distributions  are  obtained 
assuming  that  a  system  is  formed  by  a  collection  of  n  identical  modules  operating  in  parallel.  The 
system  fails  only  when  all  the  modules  fail  and  the  distribution  of  the  time  to  system  failure  becomes 
the  distribution  of  the  maximum  time  to  failure  for  the  n  modules.  As  n  approaches  infinity,  the 
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distribution  of  the  system  time  to  failure  converges  in  distribution  either  to  an  exponential,  a  Weibull 
distribution,  or  to  the  distribution  given  in  (4,18)  [Barlow  75], 


More  generally,  if 


R 


yy 


(4.20) 


then, 


P(tf<r)  =  1  ■  e 
h(t) 


(,E, k„  A  ),£!■.,  V-% 


K 


<.•21,2-11-.^ 
p  I 


(4.21) 

(4.22) 


This  last  distribution  is  commonly  used  in  nuclear  medicine  to  characterize  the  light  pulses  due  to  the 
absorption  of  gamma  radiation  emanating  from  radioactive  tracers.  The  hazard  function  reflects 
physiological  transport  phenomena  due  to  blood  flow  rate,  metabolic  exchange  rates,  and  lung 
ventilation  [Sheppard  62]. 


4.1 .1 .2.  Example  2.  The  exponential  distribution  -  white  noise  failure  rate 
If  X(  is  white  noise', 


RW  -  -T  5(t) 

(4.23) 

where  5(t)  is  the  Dirac  delta  function,  then 

.  W  , 

P(t,<r)  =  1  -  e(Q  2 

(4.24) 

h(t)  =  q  — 

(4.25) 

That  is,  the  PDF  degenerates  into  an  exponential  distribution. 

the  mean  failure  rate  (q)  but  to  the  mean  failure  rate  minus  the 

Note  that  its  parameter  is  not  equal  to 

"power"  of  the  process  W/2. 

4.1 .1.3.  Example  3.  Pareto  distribution 

Assume  that 

R  (T)  *  - - 

”  08W  +  1)2* 

(4.26) 

then 
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(qa/J)r  •  a  In  (/? t  +  1) 

P(tf<r)  =  1  -  e 

(q-afi)t 

=  1  -« - 

(j3t+  1)a 

In  particular,  if  m  =  a/i, 


(4.27) 

(4.28) 


P(t  <T)  =  1  -  — ! - 

031+1)“ 


(4.29) 


h(t)  *  HTTT  (4  30) 

which  is  the  Pareto  distribution.  The  Pareto  distribution  is  used  to  characterize  clinical  data  relating  to 
the  probability  of  survival  of  individuals  belonging  to  some  populations.  For  instance,  the  Pareto 
distribution  is  postulated  as  the  best  distribution  characterizing  the  probability  that  a  patient  waiting 
for  a  heart  transplant  (because  of  unavailable  donors  or  other  reasons)  will  die  before  receiving  the 
heart  transplant  [Turnbull  74]. 


The  choice  of  the  Pareto  distribution  or  distributions  of  the  type  (4.15)  for  analysis  of  survival  data  is 
common,  and  it  is  based  mainly  bn  heuristics.  Hence,  the  present  methodology  justifies  such  choices 
whenever  the  actual  failure  rate  can  be  approximated  by  a  stationary  Gaussian  process  with 
autocorrelation  function  given  in  (4.26). 


4.1 .1 .4.  Example  4.  An  intensity  process  with  infinite  energy  -  The  Weibull  distribution 


Consider  now  the  following  sequence  of  stochastic  processes.  yjn)  is  a  stationary  Gaussian 
process  with  mean 

_  _  a\ 


"  (X/n)1'0 

and  autocorrelation  function 


r  w , 

Jirn 

where  a<1 .  Then, 


y"y"'  '  (X(t  + 1  /n))2'a 


P„(tf<r)  =  1  •  e 
h„(t)  = 


-<A(r  +  1/n))“  +■  (A/n>“ 


(4.31) 


(4.32) 


(4.33) 


(X(t+1/n))1a 
Now  let  n  go  to  oo  and 


(4.34) 
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Pn<*!<'> 


lim  h  (t)  = 


(At)" 


1  -  e 
aX 


(4.35) 

(4.36) 


n-ao  ”  (At)'  a 

which  is  the  Weibull  distribution.  Note  that  as  n— » oo  both  the  mean  value  and  the  variance  of  yjn'  also 
go  to  oo.  Hence,  the  process 


yt  =  lim  y<n)  (4.37) 

1  n-*oo  1 

is  not  physically  realizable,  since  it  has  infinite  energy.  However,  the  fact  that  a  limiting  distribution 
exists  for  Pn(t(<r)  indicates  that  the  Weibull  distribution  may  be  the  right  choice  for  characterizing 
doubly  stochastic  Poisson  processes  with  intensity  processes  that  have  very  large  mean  and 
variance. 


4.1.2.  Discussion 

These  and  other  possible  distributions  are  summarized  in  Table  4-1.  The  fact  of  considering  the 
failure  rate  of  a  system  to  be  a  stationary  Gaussian  process  is  therefore  a  unifying  method  for 
obtaining  a  complete  family  of  distributions  commonly  used  in  reliability  theorey.  Some  more  insight 
can  be  gained  by  careful  examination  of  the  similarities  and  differences  between  these  distributions. 

4.1 .2.1 .  The  distinctive  property  of  white  noise 

The  main  difference  between  white  noise  and  any  other  stationary  Gaussian  process  is  that  of 
predictability.  The  best  predictor  (in  a  mean  square  sense)  of  yt  based  on  ys,  s  <  t,  is  E[y,|ys  =  £].  In 
general,  for  a  stationary  Gaussian  process, 

R  (t-s) 

E(yt|ys  =  £]  =  £  (4.38) 

ay 

( [Wong  79],  p.  64).  However,  if  yt  is  white  noise, 

E[yt|ys  =  £]  =  E[yt]  =  0  if  s*t  (4.39) 

White  noise  future  values  are  totally  unpredictable  no  matter  how  much  information  has  been 
accumulated  about  its  past  behavior.  On  the  other  hand,  for  a  nonwhite  noise  Gaussian  processes 
there  always  exist  constants  such  that  [Breiman  68] 

Efy,  lyt  =  . yt  =in]  =  EL, 

'n  + 1  *1  1  'n  "  *  K 


(4.40) 
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4.1 .2.2.  The  rate  of  convergence  to  a  Wiener  process 

As  for  the  meaning  of  having  a  hazard  function  whose  asymptotic  value  is  smaller  than  the  mean 
failure  rate,  note  that  the  function 

f(t)  =  f  R  (r)  dr  (4.41) 

Jo  yy 

is  in  fact  the  rate  at  which  At  converges  to  a  Wiener  process.  In  effect,  note  that 

h(oo)  =  q._W  (4.42) 

is  the  hazard  function  that  would  be  obtained  if  yt  were  white  noise  and  were  a  Wiener  process. 

4.1 .2.3.  A  different  but  equivalent  conceptual  framework 

It  is  interesting  to  note  how  some  of  the  distributions  given  in  Table  4-1  can  be  obtained  in  a 
completely  different  way.  Assume  that  the  PDF  of  the  time  to  failure  is  exponentially  distributed  with 
parameter  X,  but  that  X  is  a  random  variable.  That  is,  once  the  system  is  started  X  is  chosen  at  random 
from  a  known  distribution  and  remains  constant  until  the  system  fails.  Every  time  that  the  system  is 
restarted,  a  new  value  of  X  is  randomly  chosen.  The  PDF  of  the  time  to  failures  is  in  this  case  given  by 

P(tf<r)  =  E{P(t,<r|X)}  -  E{l  -  e^}  (4.43) 

where  the  expectation  is  taken  with  respect  the  statistics  of  X.  It  was  first  derived  by  [Harris  68]  that, 
for  instance,  if  X  is  Gamma  distributed, 

PXM  =  laxf  '  e  aX  (4.44) 

then  P(tf<r)  becomes  the  Pareto  distribution.  Similarly,  other  PDFs  can  be  derived  by  assuming 
different  distributions  for  X. 

If  the  failure  process  is  a  renewal  process,  the  following  three  types  of  systems  have  identical 
statistics: 

•  Systems  for  which  the  failure  process  is  a  doubly  stochastic  Poisson  process,  X  =  q  +  yt, 
and  R  1  (r)  leads  to  an  equivalent  hazard  function  h(t). 

•  Systems  with  a  random  hazard  function  X  such  that  p^(x)  leads  to  the  same  equivalent 
hazard  function  h(t). 

•  Systems  for  which  the  failure  process  is  a  nonhomogeneous  Poisson  process  with  hazard 
function  h(t). 
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Figure  4-1  iiustrates  the  three  types  of  systems.  Which  of  the  above  three  conceptual  frameworks 
is  more  appropiate  to  work  with  will  have  to  be  decided  usually  after  practical  considerations. 
Probabilistically,  the  three  types  of  systems  are  indistinguishable. 


4.2.  Case  II  -  Periodic  workload  (the  Cyclostationary  process) 


Assume  now  that  p(t)  is  a  periodic  function  of  t,  that  is 
*(t)  =  I'd  +  T) 
and 

x‘ =  “ +  ir/JEtsi]  R[tw/2,t+w/2] +  -irfi  \w/zt+w/2] 

If  v(t)  varies  slowly  enough  such  that  it  can  be  considered  constant  in  any  time  interval  [t,t  +  w], 

E[X,]  =  a  +  0p(t)E[Sj]  '  (4.47) 

*  q(t)  (4.48) 


(4.45) 


(4.46) 


where  q(t)  is  also  periodic  with  period  T,  and 
■  U 


E[A, 


w1 


q(t)dt 


(4.49) 


Also, 


R«(s.t)  -  E[X9Xt] 


d(s)q(t)  +  e£  Sp  Sp  J 

V/2  M[t-W/2,t  +  W/2)  [s-W/2,t  + W/2] 


(4.50) 


where 


C*i  t— iOO  li  +  n  +  wWs) 

Sp  Sp  J  =  2^n  =  0  ^l^rstl =  1  *^j  =  i  +  n  (4-51) 

[t-W/2.t+W/2j  R[s-W/2.t  +  W/2)  ls,,J  *  1  1 

Note  that  q(t)  =  q(t  +  T)  and  that  R^(s,t)  =  R^(s  +  T,t  +  T).  Thus,  X  is  a  cyclostationary  process.  As  it 

has  been  remarked  by  (Gardner  78],  if  Nr,  ,  ,  is  a  doubly  stochastic  Poisson  process  with 

iVjj 

cyclostationary  intensity  process,  Nj,  is  itself  cyclostationary,  that  is 

EK,.y]  ’  <452> 

(the  remark  in  [Gardner  78]  is  for  the  more  general  concept  of  processes  which  are  almost 

cyclostationary  in  the  wide  sense).  The  fact  that  Nr.  , ,  is  itself  cyclostationary  explains  the  data 

‘W 

reported  by  [Butner  80],  where  the  number  of  system  failures  as  a  function  of  time  of  day  reflects  the 
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average  workload  variations  over  a  one  day  period.  This  is  a  consequence  of  having  a  cyclostationary 
intensity  process  and  does  not  imply  a  strictly  periodic  failure  rate. 


Define  now, 

y,  =  V  q(t) 

and  note  that 


EOO  'wi>(t)  V“'i  +  n  +  w»(s) 

n*0  R(RfSit]  =  1  ^■Jjsi  +  n  ^[XjXjJ 


(4.53) 


(4.54) 


By  the  properties  of  an  a-mixing  sequence,  E[x.x.]  approaches  zero  as  the  difference  |i-j|  increases. 


Thus,  Ryyfs.t)  should  approach  zero  as  the  difference  |s-tj  increases.  Further,  note  that 


Fyt.t)  -  «rj(t)  -  ,(t)-£l  [E[Sj]2  +  a2] 


Thus,  Ryy(s,t)  can  be  expressed  as 

RyytS.t)  *  <Ty(S)<Jy(t)  Tj(|t-S|) 

where  tj(x)  is  a  function  of  x  such  that  ij(0)  *  1  and  tj(oo)  =  0.  Therefore, 

•  t,  /•  Li 


(4.55) 


(4.56) 


(4.57) 


=  [  f  ff  (s)<j  (t)7j(|s-t|)  ds  dt 

Jt i  y  y 

which  after  some  algebraic  manipulations  can  be  shown  to  be  equal  to 

Var[A[ti  y]  =  2y(t1.t2)j/',2ay(t)7,(|t2-t|)  dt  -  o^)  jT  2  2y(t, ,t)T,(t)  dt 


where 


ay(T)  dr 


(4.58) 


(4.59) 


(4.60) 


4.2.1.  Two  important  properties  of  the  cyclostationary  Poisson  process 


Although  not  as  simple  as  the  case  of  stationarity  intensity,  closed  form  expressions  for  the  hazard 
function  and  PDF  of  the  time  to  failure  for  cyclostationary  failure  processes  can  be  obtained.  The 
only  restriction  is  that  now  the  hazard  function  and  PDF  are  conditioned  to  the  starting  time  value. 

Corollary  2:  Let  Nrt  .  be  a  doubly  stochastic  Poisson  process  with  cylostationary 

.  . _ •.  _ \  _  * 


intensity  process  At  sucri  mat 
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\  =  q(t)  +  y,  (4.61) 

Ryy&t)  =  ffy(s)ffy(t)i}(|s-t|)  (4.62) 

q(t)  =  q(t  +  T)  <ry(t)  =  <ry(UT)  (4.63) 

The  hazard  function  of  the  equivalent  nonhomogeneous  Poisson  process  given  that  the 
system  is  started  at  time  tg  is 

h(t|tg)  =  q(t)  •  cry(t)  J  ay(r)  i]({t-r|)  dr  (4.64) 

and  its  conditional  PDF  is 


P(t,<r|ts)  *  1  •  exp  {  J  q(t)  dt .  2y(ts,r)  j  oy(t)  T,(|r-t|)  dt 

'3  f.  T  *3 

+  oy(r)jf  Iy(t3,t)  rj(t)  dt  }  (4.65) 

proof-.  From  (4.58),  (3.101),  and  (3.103),  after  (substantial)  algebraic  manipulations.  I 

The  above  hazard  function  and  PDF  are  conditioned  to  the  starting  time  value  tg.  To  obtain  the 

unconditional  functions,  the  following  expectations  should  be  computed. 

00 

h(t)  =  f  p.  (u)  h(t|t  =  u)  du  (4.66) 

Jo  ls  3 


00 

P,s(u)  P(t,<r|t3  =  u)du 


(4.67) 


where  pf  (u)  is  the  pdf  of  the  starting  time.  The  following  theorem  gives  the  value  of  this  distribution, 
s 

Its  simplicity  has  very  important  practical  implications. 


Theorem  3:  Let  N^(  t  j  be  a  doubly  stochastic  Poisson  process  with  intensity  given  in 
(4.61)  through  (4.63).  Assume  that  the  system  is  observed  for  n  consecutive  cycles  and  let 
Pn(tg<r)  denote  the  PDF  of  the  system  start  time  during  these  n  cycles.  Then 


Pn(ts<T)  -»  P(t9<r)  =  -TJ - -jT  q(s)  ds  0<r<T 

q(s)ds 


(4.68) 


or,  equivalently, 
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pt  (u)  a  lim  p"  (u)  =  — 

3  n-»oo  ’s  r  1 


q(u) 


(4.69) 


L 


q(s)  ds 


.nT  ,  nT 


-Ml  *  II  I 

limoo  Jo  Jo  <Vs)<Vt>  ds  dt  =  0 

proof:  Assume  that  nf  failures  have  been  observed  in  n  cycles.  Then, 
(u|'\.0<t<nT;N[0>nT]  =  nf)  *  7  '  - 1  •— »nf 


(4.70) 


(4.71) 


L 


K  dy 


Since  t,  *t  ,  the  above  density  is  also  the  pdf  of  t  ,  i  =  2,...,n  + 1.  Further,  note  that  the  t,  are 

7I  *i  +  1  ®i  *  ‘i 

mutually  independent  and  that  the  above  pdf  is  the  same  for  any  value  of  nf>0.  Thus, 

P"  (u|y,,0<t<nT)  -  — ^ - q(u) - - - yu 

S> 


P  nT  » nT  p  m  j»m 

J  q(s)  ds  +  ys  ds  j £  q(s)  ds  +  ^  y3 


nT  .nT 

q(s)  ds  +  /  yds 


(4.72) 


Since  q(t)  =q(t  +  T) 


n  q(u’) 


where 


P>|y,,0<t^nT) - — -  - 

’  n  J  q(s)  ds  +  ys  ds  n  jf  q(s)  ds  +  J  ys  ds 

U’  a  U  -  lu/nTj 


(4-73) 


(4.74) 


Hence, 


p  (u)  =  lim  E  {  — - 

s,  n-*oo  f  ' 


q(u') 


L 


Ef-rr 


q(s)ds  +  (1/n)  /  y  ds 


■a 


yu/n 


Now,  if 


L 


q(s)ds  +  (1/n)  /  y  ds 


A 


Sn  * 


(4.75) 


(4.76) 


s_  is  a  Gaussian  random  variable  with 

n 
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E[sJ  =  0 

- nT  - nT 

Var[s  ]  =  — /  /  0V(9}9  (t)i)(M)  ds  dt 

n2  io  /0  y 

Therefore,  if  (4.70)  holds,  sn  =  0  with  probability  one  as  n— *  oo  and 


lim  E  { 

rt~»oo 


q(u') 


£ 


T 

q(s)  ds  +  (1/n) 


£ 


nT 

ysds 


q(u’) 


q(s)  ds 


(4.77) 

(4.78) 


(4.79) 


As  for  the  second  term  in  (4.75),  note  that 

"5"  ElynSJ  “  ~r  Jo 


(4.80) 


Since  yn  is  a  physical  process,  the  integral  in  (4.80)  remains  bounded  as  the  upper  limit  goes  to  oo. 
Therefore  yn  and  sn  become  uncorrelated  and  independent  (both  are  Gaussian  random  variables)  as 
n-*oo.  Thus, 


lim  E  {  — - 

n-*oo  r  1 

JO 


y  /n 

'u 


q(s)  ds  +  (1  /n)  /  y  ds 


£ 


nT 


-}  =0 


(4.81) 


which  completes  the  proof. 


4.3.  Summary 

The  analysis  of  systems  under  constant  average  workload  has  lead  to  a  complete  family  of 
distributions  commonly  used  in  reliability  theory.  The  distinctive  property  between  different 
distributions  is  the  autocorrelation  function  of  the  intensity  process.  The  fact  that  all  distributions 
have  limiting  hazard  function  values  smaller  than  the  average  failure  rate  is  of  particular  importance. 
The  limiting  hazard  function  value  is  the  value  that  would  be  obtained  if  the  failure  rate  were  white 
noise.  The  rate  of  convergence  of  the  integrated  failure  rate  to  a  Wiener  process  has  been  shown  to 
be  the  integral  of  the  autocorrelation  function. 

It  is  important  to  note  that  the  rate  of  convergence  to  a  Wiener  process  is  one  of  the  parameters 
characterizing  the  reliability  of  the  system  under  study.  Consider  two  identical  systems,  A  and  B,  such 
that  the  failure  rate  of  system  A  is  white  noise,  while  the  failure  rate  of  system  B  is  some  other 
Gaussian  process.  Although  both  systems  can  do  the  same  amount  of  work  in  the  same  time  (in  the 
sense  that  the  expected  value  of  the  integrated  failure  rate  is  the  same)  system  A  is  more  reliable  than 
system  B.  The  integral  of  the  failure  rate  for  system  A  is  a  Wiener  process  no  matter  how  short  the 
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* 

s' 

integration  interval  is.  Therefore,  system  A  reaches  the  qsymptotic  (minimum)  value  of  its  hazard 
function  instantly.  This  point  will  be  elaborated  later  dn  in  the  thesis  and  will  be  illustrated  with 
numerical  examples. 

The  analysis  of  systems  under  periodic  workload  has  not  resulted  in  so  concise  results.  However, 

✓ 

an  important  property  of  cyclostationary  failure  processes  is  that,  for  systems  operating  after  many 
cycles,  the  distribution  of  the  systejn  failure  time  over  one  cycle  converges  to  the  periodic  component 
of  the  failure  rate.  This  fact  wifi  lead  in  Chapter  7  to  the  establishement  of  cost  functions  on  which 
cost-benefit  analysis  of  fault-tolerance  can  be  based. 

/ 

/ 

Throught  the  Chapter  it  has  been  assumed  that  the  failure  rate  can  be  approximated  by  a 

deterministic  function  of  time  plus  a  zero  mean  Gaussian  process, 

/ 

\  =  q(t)  +  yt  (4.82) 

However,  front  chapter  3  it  is  known  that 

Xt  *  pu  +  -ylr  (Pk-pu  +  pSKm(l)  +  xt)  (483> 

where  m(t)  +  xt  is  the  fraction  of  time  in  kernel  mode  in  the  interval  [t-W/2,t  +  W/2]  and  Pu  ,  Pk  ,  Ps  are 
the  coefficients  establishing  the  sensitivity  of  the  system  to  different  failures  depending  on  the  system 
state.  Therefore,  can  be  rewritten  as 

X,  =  k1  +  ic2(m(t)  +  Xj)  (4.84) 


Hence,  the  failure  process  can  be  characterized  by  a  doubly  stochastic  Poisson  process  with 
intensity 


\  *  f(m(t),xt,* ) 


(4.85) 


where  f()  is  an  arbitrary  function  and  i?ia  a  vector  of  coefficients.  In  all  cases,  the  PFO  of  the  time  to 
failure  can  be  expressed  as 


po.S'iy 


»  f 

TIU  *  l  ei. 


(4.86) 


where  h(t)  is  the  hazard  function  of  the  equivalent  nonhomogeneous  process. 
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Chapter  5 

Failure  process  analysis  of  a  real  system 

5.1 .  System  characteristics  and  measuring  tools 

In  order  to  verify  that  the  model  described  in  Chapter  3  leads  to  a  better  fit  to  failure  processes  than 
previous  work,  an  experiment  was  designed.  Data  was  acquired  for  both  the  failure  processes  and 
the  load  of  a  general  purpose  time  sharing  system.  The  system  chosen  was  the  CMU-A,  a  PDP-10 
used  by  the  Computer  Science  Department  at  Carnegie- Mellon  University  as  its  main  general  purpose 
computational  system.  The  system  consists  of  a  KL-10  processor,  one  megaword  of  memory,  eight 
disk  drives  totaling  1600  megabytes  of  online  storage  and  two  magnetic  tape  drives.  The  system  runs 
a  slightly  modified  version  of  the  standard  TOPS- 10  operating  system  [Bell  78]. 

The  software  packages  used  to  instrument  the  experiment  are  illustrated  in  Figure  5-1.  Information 
about  failures  is  obtained  from  an  online  error  log  file  maintained  by  a  system  program,  which  records 
the  information  produced  by  different  error  formatting  routines.  Entries  are  made  to  this  file  for  each 
hardware  error  detected  in  the  system,  for  system  reloads,  for  disk  performance  statistics,  and  so  on 
[Digital  78].  The  error  log  is  later  processed  by  SEADS,  a  FORTRAN  package  which  lists  the  times  of 
detection  of  errors  associated  with  a  particular  resource.  In  order  to  obtain  accurate  information 
about  the  use  of  the  system,  a  special  SAIL  program,  SYSMON,  was  written  that  periodically  samples 
the  values  of  30  system  parameters.  The  files  generated  by  SYSMON  are  later  processed  by  another 
SAIL  package,  READSY,  which  computes  the  periodic  component  and  autocorrelation  function  of  the 
utilization  function  of  a  particular  system  resource.  The  information  generated  by  SEADS  and 
READSY  is  then  processed  by  an  APL  package  (POWELL)  which  estimates  the  maximum  likelihood 
parameters  of  the  pdf  of  the  time  to  failure  of  a  particular  resource.  Finally,  in  a  separate  SAIL 
package,  C2TST,  the  values  predicted  by  the  cyclostationary  model  and  other  models  described  in 
Section  4  are  compared  with  the  information  stored  in  the  error  log  according  to  a  x2  goodness-of-fit 


test. 
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The  value  of  the  accumulated  time  spent  in  kernel  mode  is  obtained  by  executing  a  Monitor  Call 
and  includes  the  time  spent  in  clock  queue  processing,  short  command  processing,  swapping  and 
scheduling  decisions,  and  software  context  switching  [Digital  77].  This  value  does  not  include 
Monitor  Call  execution  nor  I/O  interrupt  times.  The  sampled  value  is  not  exactly  the  time  that  the 
system  is  executing  in  kernel  mode,  but  it  is  dose  enough  for  our  purposes. 


5.2.  Model  parameterization 

According  to  the  results  presented  in  Chapter  3,  the  failure  process  of  a  Time  Sharing  computing 
system  can  be  characterized  by  a  doubly  stochastic  Poisson  process  with  intensity  process 

X,  =  f(m(t),xt,K)  (5.1) 

where 

kt  =  m(t)  +  xt  (5.2) 

is  the  average  fraction  of  time  spent  in  kernel  mode  in  an  interval  of  duration  W  centered  at  time  t  and 
xNs  a  vector  of  parameters.  In  order  to  parameterize  our  model,  the  values  of  X,  must  be  sampled  from 
real  systems,  and  from  these  samples  m(t)  and  the  autocorrelation  function  of  xt  must  be  estimated. 
Further,  methods  for  estimating  the  maximum  likelihood  values  of  i t must  also  be  provided. 


5.2.1 .  Sampling  the  intensity  process 


The  operating  system  automatically  measures  the  cumulative  time  spent  in  kernel  mode.  That 
means  that  the  value  of 

Kt*/s  krdT  (5-3) 

can  be  easily  sampled.  If  the  value  of  K(  are  sampled  at  times  {tn.w/2,  tn+w/2,  tn  +  1.w/2>-}  samples  of 
the  observable  intensity  process  are  immediately  available  as 


k,  =K, 

ln  ln+W/2 


-K 


n-W/2 


where  tn  ■  t#  +  n  At  and  t  is  the  system  start  time. 


(5.4) 


Figure  5-1:  Software  packages  used  in  the  validation  of  the  cyclostationary 
modeling  methodology. 
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5.2.2.  Estimating  the  deterministic  component 

The  expected  value  of  k{1  m(t),  is  a  deterministic  function  of  time  with  period  T  =  24  hours.  Thus, 
m(t)  can  easily  be  estimated  from  the  samples  k{  .  If  data  has  been  collected  for  N  days,  let 

n 

"■«„>  =  <55) 

m(t)  will  be  then  approximated  by  a  finite  Fourier  series  expansion,  that  is, 

m(t)  =  m  +  ]Cnsl  cnsin(nut  +  <pn)  (5.6) 

where  the  following  constants  have  been  used 


u  == 


2w 


m 


’-r  L 


(t)  dt 


(5.7) 


cn  -  <  +  *4> 


i/a 


<pn  =  arctan— — 


(5.8) 


-T  -T 

an  =  J  m’(t)  cos(nut )  dt  bn  =  ~~ J  m’(t)  sin(n«t )  dt 


5.2.3.  Autocorrelation  function  estimation 


(5.9) 


Given  an  ergodic  and  stationary  process  z(,  the  problem  is  to  estimate  the  function 

z'”z'dt  |6•10, 

For  a  finite  record  of  observed  values  zt  ,  n  ■  1,...,N„  the  autocorrelation  function  is  usually  estimated 

Vi 

using  the  expression 

1  v'N*n 

(5.1D 


This  estimate  is  intuitive  except  for  the  factor  l/n.  Since  N-n  terms  are  summed,  it  seems  that  l/(N-n) 
would  be  more  exact.  In  fact  (5.11)  is  a  biased  estimator  of  the  real  autocorrelation  function.  However, 
its  expected  error  is  smaller  than  the  expected  error  that  would  be  obtained  using  the  (unbiased) 
estimator  with  factor  1/(N-n)  [Jenkins  68]. 
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In  the  cases  presented  in  this  thesis  the  values  of  z.  are  not  directly  observable.  For  the  sampling 

n 

of  the  fraction  of  time  in  Kernel  mode,  what  is  measured  is  the  average  fraction  of  time  in  Kernel  mode 
during  a  period  of  duration  W.  The  measured  values  are  not  the  values  of  z(  ,  but  the  values  of  the 

n 

process 


t  +  W/2 


/I  T 
" 

-W/2 


Ztdt 


(5.12) 


As  will  be  shown  in  the  following  sections,  in  the  two  cases  studied  in  this  thesis  the  autocorrelation 
function  suggests  an  approximation  of  the  form 


_  -fifl  P2» 

=  ate  +  a2e 

The  problem  is  then  to  estimate  the  values  of  the  a.,  /?.  from  the  observed  values  of  z{  .  If 


R^t)  =  a^e 
it  is  easy  to  show  that 


+  a  2e 


■PS 


R«(t)  “ 


-0,|tl 


+  a2e 


-PS 


where 


=  2[cosh(/?.W)  - 1] 


(5.13) 


(5.14) 


(5.15) 


(5.16) 


The  problem  is  then  to  estimate  the  values  of  the  a't,  /8(  using  (5.11)  and  the  observed  values  of  z(  , 

n 

and  use  (5.16)  to  obtain  the  values  of  a(  of  the  autocorrelation  function  of  z{. 


Unfortunately  it  will  not  always  be  possible  to  follow  this  procedure.  The  accuracy  of  the  estimated 
autocorrelation  function  is  limited  basically  by  two  factors  :  the  sampling  frequency  and  the  length  of 
the  available  record,  N.  Although  many  techniques  exist  for  power  spectrum  estimation  that  take  into 
account  these  two  factors  (Oppenheim  75]  (the  power  spectrum  is  the  Fourier  transform  of  the 
autocorrelation  function),  no  techniques  are  available  for  correcting  the  estimates  of  the 
autocorrelation  function  itself.  If  the  sampling  frequency  is  comparable  to  the  bandwidth  of  the  power 
spectrum,  the  power  spectrum  estimate  may  be  poor  due  to  aliasing.  Under  these  conditions,  the 
estimate  of  the  autocorrelation  function  given  by  (5.1 1)  may  take  negative  values. 
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5.2.4.  Maximum  likelihood  estimation  ot  model  coefficients 

The  general  problem  of  parameter  estimation  for  doubly  stochastic  Poisson  processes  can  be 
stated  as  follows.  Let  {N(t);t>tQ}  be  a  doubly  stochastic  Poisson  counting  process  with  intensity 

X(t.xr£).  where  x,  is  a  stochastic  process  and  x  =  (xrx2, . xm)  is  a  vector  of  unknown  parameters. 

The  occurrence  density  function  that  a  given  realization  of  the  process  has  a  failure  at  time  t(  if  it  has 
been  started  at  time  ts  is,  given  by 


.  •  /  A(t,x  ,5} 

p(tf|x,xT,t9<r<tf)  =  A(tf,xt  ,ie)e\ 


dr 


(5.17) 


If  n  failures  are  observed  at  times  t,  . . t.  with  associated  starting  times  t  . . ,t  ,  the  probability 

'l  ’n  S1  sn 

density  function  of  observing  such  set  of  events  is 


p(n)(tf^,...,t(  |i?;xT,t^<r<tf  ,i  =  1 . n) 


nn  .  -  /  1  A{r,x  ,*)dr 

i=1P(ts)X{tf,x  i t)e.L  T 

si  i 


(5.18) 

n  ’  i 

where  P(t  )  is  the  a  priori  probability  that  the  system  is  started  at  time  t  .  Taking  the  expectation  with 

*1  *1 

respect  the  statistics  of  x(  we  can  obtain, 


3(n)(t, . ,tf  |£t9 . ts  )  =  e{  n:=1P(U  X(t,  ,x(t ,),K)e\  X(T'V**dT  } 


(5.19) 


The  maximum  likelihood  estimate  it'  =  (x’^xj, . x^)  of  i?in  terms  of  a  particular  realization  of  the 

process  is  by  definition  the  value  of  it  that  maximizes  the  above  density  function  [Melsa  78].  That  is, 

p(n)(t . ,t,  |i£t  ,i  *  1 . n)  will  be  maximum  for  it  *  it'.  As  it  has  been  shown  in  Chapter  3,  the  pdf 

'i  \»  *i 

of  the  time  to  failure  can  be  written  as 


p(t)  =  h(tf,x) 


h(r,*J  dr 


the  function  to  be  maximized  is  then 

P(n)(t, . t.  )  =  H><U  h(L  ,x)  e 

T1  'n  T  i 

Note  that  this  problem  is  equivalent  to  minimizing  the  function 
L (x )  =  (Hia1  f  '  h(T,x)dr)  •  (Hia1  ln(h(t{,it)]) 

subject  to  the  constraints 


(5.20) 


(5.21) 


(5.22) 
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h(tf  ,i<»0  i  =  1 . n  (5.23) 


Since  closed  form  expressions  for  the  components  of  k  at  the  minimum  are  not  generaly  available, 
this  is  a  typical  nonlinear  programming  problem,  subject  to  nonlinear  inequality  constraints.  Since  this 
problem  will  have  to  be  solved  every  time  that  the  failure  process  of  a  resource  has  to  be  modeled  for 
a  real  system,  particular  care  has  been  taken  in  finding  an  efficient  procedure  for  the  location  of 
minimums  of  functions  of  the  type  (5.22).  The  algorithm  used  is  a  slightly  modified  version  of  a 
variable  metric  algorithm  proposed  by  [Powell  78].  The  original  Powell  algorithm  occasionally 
requires  the  evaluation  of  the  objective  function  outside  the  constraints  and  has  been  modified  such 
that  the  maximum  step  size  at  each  iteration  never  leads  to  a  point  outside  the  constraints.  The 
algorithm  has  been  implemented  as  an  APL  package  that  requires  the  definitions  of  the  objective 
function,  gradient,  and  constraints.  Several  objective  functions  corresponding  to  different 
distributions  were  given  in  [Castillo  80b]. 

5.2.5.  Error  correction 

The  last  practical  consideration  to  be  treated  in  this  section  deals  with  the  approximation  of  k(  as  a 
Gaussian  random  variable.  If  k{  is  a  Gaussian  random  variable  with  mean  m(t)  and  variance  a2,  there 
is  a  finite  probability  that  k{<0 


Note  that  the  integral 


is  equal  to  the  excess  area  under  the  peaks  of  xt  below  a  threshold  kmjr|  m(t)  (see  Figure  5-2). 


(5.29) 


Figure  5-2:  Relationship  between  x{  and  x®. 


It  is  shown  in  [Stratonovich  67]  that  if  the  duration  of  the  peaks  is  much  smaller  than  the  time 
between  peaks,  the  occurrence  times  of  peaks  above  a  threshold  c(t)  can  be  approximated  by  a 
Poisson  process  with  intensity 

(R,)1/2  c2(t)/2v2 

rj(t)  ,  — | -  e  (5.30) 


(5.30) 


where 


3  Rxx(t) 


(5.31) 


and  <j2  is  the  variance  of  x{.  Since  the  presence  of  peaks  can  be  characterized  as  a  Poisson  process, 
the  excess  areas  under  the  peaks  can  be  viewed  as  a  marked  Poisson  process  (see  [Snyder  75],  Ch. 


7)  with  nondenumerable  mark  space.  Let  s.  denote  the  area  under  the  i-th  peak  in  the  interval  [s.t].  a 
will  be  the  mark  associated  with  the  i-th  point  (peak)  in  [s.t]. 
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The  statistics  of  the  sum  of  the  areas  under  the  peaks  are  equivalent  to  the  statistics  of  the  mark- 
accumulator  process, 

f  *  -  m 

where  Nj(  tj  is  the  number  of  peaks  in  [t3,r].  The  above  expectation  can  therefore  be  rewritten  as 
.PI  31) 

L  X,dt}  =  P(Nf.  .  =  0)  +  E"«P(Nf,  ,,  =  n)E{e^i  =  lSiK  (5-33) 


([Snyder  75], pi  31) 
E{e, 

Note  that 
E 


VI ' 


{e^'=  1  r]  =  n}  =  E {E[e^'s  lSi  |  ,  =  n;tr  t2 . tN_  ]  |  r]  =  n  }  (5.34) 

S'  ' 

Given  Nr,  ,  =  n  ,  t,,...,tM  are  a  collection  of  independent,  identically  distributed  random  variables 

S,T  [t  ,r] 

( [Snyder  75],  p.  65),  the  common  distribution  being 


P«  M  -  -7i 


T?(X) 


f  rj(t)  dt 

K 


t  <X<T 

3 —  — 


Therefore,  if.the  areas  under  different  peaks  are  mutually  independent, 

e[ 

and,  if  the  i-th  peak  occurs  at  time  t , 

E  {e*1 }  =  jf  p,_(x)  E[eSti|t,  =  x]  dx 


f ,  li.iuy  ui  uo  u  i  iuui  ui  v  muiwuii; 

. . . . s,„]  ■  MeV> 


(5.35) 


(5.36) 


(5.37) 


ft  f  oo 

l  Wo  V/*1 trx)®Xdxdx 


where 


and 


o  1  ,3c  (r)  _  1/2  ,1/3 

3  3 a3 


c(t)  *  k^  -  m(t) 


(5.38) 


(5.39) 


(5.40) 


If  s,«1  for  all  i,  the  following  approximation  can  be  made 

*i 

E  { e*1 }  -  1  +  £  p,  (xJEts^lt.  *  x]dx 


(5.41) 


and 
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E[S, 


^%]~L 


co 


XP.  (X)  dx 


(4=-),/a 


c2(x)  R2 

Define  then 

ECe*1}  -  1  +  nta,r) 

where 

o3( 2w/RJ1/2  (' 

i  —  dT 


/• 


(T)/2  dr  *s  C  (T) 


Therefore, 


dt}  =  P(N[t  T,-0)  +  nr=i  P(Nj,  i  =»  n)  [1  +  f(ts,T)]n 


and,  since  Njt  tj  is  a  Poisson  process  with  intensity  ij(t), 


r  r<?«i  v“  CffW’ ./**• 

•le-*8  j  -  o - - - fjj-3 -  e  -*s 


n(t)dt 


Note  that  if  c^r)  ■  c  independent  of  t  and  x(  is  stationary, 
E{ej(  **}  -  .'fcXTV 


where 


p(c) 


<r3(w/2)1/s 


(5.42) 

(5.43) 

(5.44) 

(5.45) 

(5.46) 

(5.47) 

(5.48) 
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5.3.  Characterization  of  the  time  to  System  Failure 

In  this  section  the  modeling  methodology  presented  in  Chapter  3  will  be  applied  to  characterize  the 
reliability  of  the  system  described  in  Section  4.1.  All  the  necessary  techniques  to  parameterize  the 
model  have  been  given  in  Section  4.2.  An  exact  characterization  will  be  developed  in  Section  4.3.1, 
where  the  periodicity  of  the  workload  is  taken  into  account.  In  Section  4.3.2  the  model  is  simplified 
giving  the  characterization  that  would  result  from  assuming  a  constant  workload,  and  therefore  a 
stationary  intensity  process.  Figures  5-3  through  5-7  summarize  the  behavior  observed  of  the  CMU-A. 

Figure  5-3  gives  the  actual  values  of  the  average  fraction  of  time  in  kernel  mode,  k( ,  averaged  over 
one  second  and  sampled  every  five  minutes  for  five  consecutive  weekdays.  The  periodicity  of  the 
mean  is  clear  from  this  figure.  As  a  further  indication  that  k{  can  be  approximated  by  a  cyclostationary 
process,  Figure  5-4  shows  the  estimated  autocorrelation  function  of  k(,  Rkk(r),  according  to  equation 
(5.11).  Rkk(f)  is  obviously  periodic  with  a  period  of  24  hours.  The  estimated  autocorrelation  function 
was  obtained  from  a  record  of  samples  k(  covering  60  days  of  normal  system  operation. 

n 

Figure  5-5  shows  the  estimated  average  fraction  of  time  in  kernel  mode  m'(t),  and  its  Fourier  series 
expansion,  m(t),  obtained  as  described  in  Section  5.2.2.  Figure  5-6  shows  the  histogram  of  system 
failures  as  a  function  of  time  of  day.  To  study  the  properties  of  the  stochastic  component,  xt ,  a  plot  of 
the  variance  <r*(t)  as  a  function  of  time  of  day  is  given  in  Figure  5-7.  The  variance  is  about  two  orders 
of  magnitude  smaller  than  the  mean  m(t).  Therefore,  the  error  correction  term  given  in  Section  5.2.5. 
should  be  very  small.  Note  also  that  the  variance  is  approximately  constant  over  a  one  day  period.  The 
peak  between  9:00  and  10:00  is  probably  due  to  the  fact  that  the  system  is  started  between  those 
times  after  daily  preventive  maintenance .  Therefore,  x(  can  be  approximated  by  a  stationary  Gaussian 
process  (although  the  results  given  in  Chapter  4  predict  a  periodic  variance,  this  periodicity  is  not 
noticeable  here). 

In  summary,  the  instantaneous  fraction  of  time  in  kernel  mode  can  be  approximated  by 

kt  a  m(t)  +  x,  (5.49) 

where  m(t)  is  periodic,  x(  stationary,  and  k(  cyclostationary. 


Estimated  Aut.  Fun. 


Number  of  Crashes  Overhead  Time 
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Figure  5-5:  Fraction  of  time  in  kernel  mode  averaged  over  a  one  day  period 


Figure  5*6:  Number  of  system  failures  as  a  function  of  time  of  day 


Var(t) 
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Figure  5-7:  Variance  of  xt  averaged  over  a  one  day  period 


Figure  5-8:  Estimated  and  approximated  autocorrelation  function  of  the 
process  x( 
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5.3. 1 .  The  cyclostationary  model 


With  the  notation  developed  in  Chapter  3,  recall  that  if  X  is  the  failure  rate  of  a  Time  Sharing 
system, 

\  =  pu  +  (Pk-Pu  +  Ps)kt  (5.50) 

where  k{  is  the  fraction  of  time  in  kernel  mode  averaged  in  an  interval  [t- W/2,t  +  W/2]  and  pu,  pk,  and 
Ps  are  different  parameters  reflecting  the  sensitivity  of  the  system  to  transient  faults  and  software 
faults.  To  remember  the  meaning  of  each  parameter,  X(  will  be  rewritten  as 

X  =  c.  +  (s.  +  s  )k,  (5.51) 

t  hw  '  hw  sw'  t  1  ' 

chw  is  a  constant  (workload  independent)  failure  rate  due  to  hardware  transient  faults,  shw  is  a 
sensitivity  coefficient  relating  the  kernel  usage  with  the  (workload  dependent)  failure  rate  due  to 
transients,  and  ssw  is  an  analogous  sensitivity  coefficient  for  the  failure  rate  due  to  software  faults. 


The  autocorrelation  function  of  the  process  xt  is  shown  in  Figure  5-8  suggesting  that  an 
approximation  of  the  form 

-/f.N  -PM 

RJt)  -  aie  1  +  a2e  *  (5.52) 

is  appropriate  to  describe  it.  Using  the  results  given  in  Chapter  3.  the  PDF  of  the  time  to  failure 
conditioned  to  a  starting  time  ts  is  given  by 

P(t<t,|t3)  =  i-e'<Chwff,'W8)  ^ 


t 


— (1-e 

P 1 


MV 


Pi 


MV, 


(5.53) 


where  the  following  constants  have  been  defined 


s  =  s  +  s. 

sy  sw 


32  -£l 


(5.54) 

(5.55) 


2  2 
a2  ”  \J~ 


and  the  unconditional  PDF  is  given  by 


(5.56) 
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yt,<r)  =»  i  -  9sy(T)e(as/ffsy1'ffsy2)T 


syl 


*1 


•/V 

i- 


sy2 


H2r 
{i-e  2  1 


(5.57) 


where 


asy  =  <Shw  +  m  +  ^hw  *^®hw  +  ®sw’^min^ 

(5.58) 

ffsy1  *  ^Ssw  +  Sh«P  ~jf^ 

(5.59) 

2  a2 

°sy2  ~  ^Ssw  +  ShwP  a  -r  +t 

H2  T  / 

-i  f  -  /  m<s)  da 

9sy(t)  -  — | - j  fn(r)  e  It  dT 

(5.60) 

(5.61) 

5.3.2.  The  stationary  approximation 

Some  approximations  can  be  made  leading  to  simpler  expressions  for  the  PDF  of  the  time  to  failure. 
In  particular,  the  periodicity  of  the  workload  will  be  neglected  and  the  system  failure  rate  will  be 
assumed  to  be 


X,  »  c  +  s  kt 


(5.62) 


where 


kt  »  m  +  x{ 

and  x(  is  a  stationary  Gaussian  process  with  autocorrelation  function 

R„(t)  -  o2  n(M) 

If  this  autocorrelation  function  is  of  the  form 

_  ,  .  H^\ 

^(t)  »  0,0  +  a2e 


(5.63) 


(5.64) 


(5.65) 


using  the  results  of  section  3.5.1. 2  the  following  expressions  are  obtained  for  the  PDF  and  hazard 
function  of  the  time  to  system  failure 
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P(t,<T)  =  1  •  exp{  -(  c  +  s  m  •  s2  —— 

Pi 


D* 


(5.66) 


P\ 


(5.67) 


Table  5-1  shows  the  maximum  likelihood  values  for  both  the  Stationary  and  Cyclostationary 
approximations  computed  from  a  history  of  243  system  failures  (crashes)  from  December  of  1979  to 
May  1980.  After  performing  a  x2  goodness- of-fit  test  between  the  predicted  and  observed  distribution 
of  failures,  both  approximations  gave  levels  of  confidence  larger  than  0.05,  suggesting  the 
acceptance  of  both  distributions  as  good  characterizations  of  the  PDF  of  the  time  to  failure. 


Figure  5-9  shows  the  hazard  function  of  the  equivalent  nonhomogeneous  process  Poisson  process 
for  both  the  Cyclostationary  and  Stationary  approximations.  The  periodic  component  of  the  failure 
rate  has  been  dampened  so  much  that  only  the  exponentially  decreasing  effect  can  be  observed,  and 
the  Cyclostationary  and  Stationary  hazard  functions  are  undistinguishable. 


A  further  approximation  can  be  made.  If  the  autocorrelation  function  is  simplified  to  a  single 
exponential, 


R*x<T>  = 


(5.68) 


then 


-(c+sm  -  3  — —  )t  -  S  - (1-9  ] 

p  q2 


P(t,<T)  =*  1  •  e 


h(t)  =  c  +  s  m  -  s2~  [1  -  e  ^  ] 
P 


(5.68) 

(5.70) 


5.3.3.  A  further  refinement  of  the  cyclostationary  model 


Equation  (5.51)  implies  that,  while  the  system  is  in  kernel  mode,  the  probability  of  observing  a 
failure  due  to  software  on  a  time  interval  At  is 

Paw(At)  ■  At  (5.71) 

which  is  a  constant  independent  of  the  state  of  the  system.  This  can  hardly  be  a  reasonable 
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t  (Hours) 


Figure  5-9:  Hazard  function  of  the  equivalent  nonhomogeneous  Poisson 
process  describing  the  system  failure  process  in  both  the  Cyclostationary  and 
Stationary  forms.  The  two  dashed  lines  indicate  the  values  of  the  hazard 
function  -at  zero  and  infinity. 


Model 

Parameter 

Values 

Degrees  of 
Freedom 

X2  value 

X0.05 

Cyclostat. 

sw  =  2.23 
chw  =  0.082 

18 

15.89 

27.869 

Stationary 

a  *0.08 
%1  =  0.073 
<r  =0.0041 
0,  =0.28 

P2  =  0.0039 

14 

15.89 

23.68 

Level  of 
Confidence 


Table  5-1:  Results  of  applying  a  x2  goodness-of-fit  test  for  the 
Cyclostationary  and  Stationary  models  for  system  failures  (crashes).  Both 
models  give  levels  of  confidence  larger  than  0.05,  therefore  confirming  their 
validity  as  accurate  system  characterization  tools 


approximation.  If,  as  described  in  Chapter  2,  software  unreliability  is  mainly  due  to  persistent  errors 
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deriving  from  oversimplifying  the  complexity  of  the  data  to  be  processed,  sgw  should  not  be  a 
constant.  In  effect,  the  instantaneous  probability  of  observing  a  failure  due  to  a  software  fault  should 
increase  when  the  system  is  processing  data  with  large  complexity  and  should  decrease  when 
processing  simple  data.  Thus,  sgw  should  be  a  time  varying  function  sgw(t)  whose  instantaneous  value 
will  depend  on  the  average  complexity  of  the  data  to  be  processed  at  time  t.  The  problem  is  therefore 
how  to  characterize  data  complexity  since,  if  c(t)  is  a  suitable  descriptor  of  the  complexity  of  the  data 
at  time  t, 

ssw(t)  =  f(c(t)) 

where  f(x)  is  a  nondecreasing  function  of  x.  The  most  easily  measurable  descriptor  of  data  complexity 
is  the  average  time  spent  in  kernel  mode,  m(t).  In  effect,  a  large  value  of  m(t)  indicates  a  highly  loaded 
system,  impliying  therefore  a  large  number  of  decisions  to  be  taken  by  the  kernel  per  unit  time  and 
continuous  updating  of  its  data  structures.  A  small  value  of  m(t)  indicates  a  lightly  loaded  system,  with 
relatively  static  and  half  empty  data  structures.  Note  that  the  instantaneous  value  of  the  fraction  of 
time  in  kernel  mode,  k,  is  not  a  good  descriptor  of  data  complexity  because  the  fact  that  for  a  second 
the  kernel  has  been  executed  a  very  short  period  of  time  is  not  meaningful  (perhaps  a  large  number  of 
jobs  were  just  waiting  for  I/O  completion). 

s  ft)  will  therefore  be  assumed  to  be  a  nondecreasing  function  of  m(t).  Again  for  simplicity  a  linear 

sw 


relationship  will  be  assumed  such  that 

=  V.1m»  +  Ssw2  <572> 

\  *  Ch*  +  [s^  +  m(t)][m(t)  +  xtl  (5.73) 

where  sh^  +  s_  has  been  noted  s,.  If 

nw  sw^  3y 

q(t)  *  s,y  +  s«w  m(t)  (5.74) 

then 

\  •  +  q(t)m(t)  +  q(t)x{  (5.75) 

-  m'(t,i?)  +  x’t(if )  (5.76) 


where  k1  -chw,  x2  *  sgy,  *3=«sgw  ,  and  k  *  (k1,k2,k3).  Using  the  results  given  in  Chapter  3.  the  PDF 
and  hazard  function  of  the  time  to  failure  are  easily  obtained.  Juts  note  that 


A  COMPATIBLE  HARDWARE/SOFTWARE  RELIABILITY  PREDICTION  MODEL 


RxV(s,t)  =  E[x'ax't] 

=  q(s)q(t)E[xsx,] 

=  a’x(s)<r’x(t)T,(|s-t|) 

where 


(5.77) 

(5.78) 

(5.79) 


®’x(t)  =  q(t)<rx  (5.80) 

Therefore  the  PDF  of  the  time  to  system  failure  is  given  by  (4.67)  evaluated  by  substituting  m(t)  by 
m’(t)  and  ax(t)  by  a’x(t)  in  (4.64)  and  (4.65).  Given  a  set  of  n  observations  of  system  starting  and  failing 

time  {[ta,tf  ];  i  =  1 . n},  according  to  section  5.2.4,  the  maximum  likelihood  estimators  of  ins  the  value 

of  it  which  minimizes  the  function 


L(k)  «  H"=1  H(ts_,t(j,K )  ■  Euilnthd^,*)] 
subject  to  the  constraints 


(5.81) 


h(t  .t.,K)>0i  =  1 n  (5.82) 

*l 

Kj>0i=»1 . 3  (5.83) 

The  values  of  h(t  ,t,  ,k“)  and  H(t  ,t,  ,it)  can  be  obtained  from  the  results  presented  in  Section  3.6. 

(5.84) 


h(t  ,t.  ,k  )  «  m  (t. )  -  ff  (t. )  f  o'  (r)  rj(|tf  -t|)  dr 

*i  ?i  i  x  i  Jt  x  ri 

*1 


my,,*)  -  m’(t)dt-  rx(y f)j\  <r’x(t)  Tj(|tf_-t|) dt  +  <x'x(t,)  2’x(y)  i,(t) dt  (5.85) 

S.  3.  3, 


where 


rx(a,b)  =  jf  o’x(t)  dt 
and  m’,  o'  ,  if,  2’  are  functions  of  it. 


(5.86) 


The  minimization  of  !(«■)  in  (5.81)  is  a  well  defined  non  linear  programming  problem.  However,  the 
relationships  between  the  affected  variables  are  cumbersome.  A  simpler  method  to  evaluate  a  good 
estimate  of  it  would  be  helpful. 
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5.3.4.  A  computational  shortcut 


In  Section  3.6.2  it  was  shown  how  for  a  system  under  periodic  workload  the  distribution  of  the 
system  failure  time  (system  start  time)  as  a  function  of  time  of  day  should  approach  the  average 
intensity,  a  linear  function  of  the  average  workload.  Thus,  one  would  expect  that  for  an  observation 
interval  sufficiently  long,  a  histogram  of  the  system  failure  time  as  a  function  of  time  of  day  should 
approach  m’(t).  Assume  that  such  a  histogram  has  been  evaluated  in  C(t) 


C(t)  =  n  if  the  number  of  failures  in  [  t/At,t/At  +  At  ]  =  n 
Recall  that  the  system  failure  rate  can  be  expressed  as 
X  =  f(m(t),K )  +  xt(K ) 

Therefore,  a  possible  estimate  of  x'is  the  value  of  ic'which  minimizes  the  norm 
N  (it)  =  ||C(t),f(m(t),*))|| 


(5.87) 


(5.88) 


(5.89) 


defined  in  a  suitable  functional  space.  In  particular,  if  the  norm  chosen  is  L[o.t]-  the  estimate  of  k  will 
be  that  value  of  it  which  minimizes  the  function 

,T 

r-  .  n  z 

(5.90) 


2  f  C(t)  3f(m(t),ic_)  dt  =  f  ) .9f(m,(t).£)  dt 

Jo  ok.  Jo  die. 


N(x)  =  j £  [  C(t)  -  f(m(t),« )  ]  2dt 

Differentiation  with  respect  to  k.  ,  the  following  system  of  equations  is  obtained 

-T  , 

i  =  1 . n  (5.91) 

where  n  is  the  number  of  components  of  Jt.  In  particular,  if  f(m(t),ic‘)  is  a  polynomial  of  order  n-1  on 
m(t), 

f(m(t ),lt)  *  Du  i  KilmWr1 
the  following  system  of  n  equations  is  obtained 

Xj-XZi^H  j*  1,...,n 

where 

Xj  -  /  C(t)[m(t)f 1  dt 
1  Jo 


j*1.—.n 


(5.92) 


(5.93) 


(5.94) 


Mi 


/  [m(t)]M 
Jo 


dt 


i  =  1,...,2n-1 


(5.95) 
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If  is  of  the  form  given  in  (5.73),  a  system  of  three  equations  is  obtained  with  =  chw,  *2  =  ssy, 


Figure  5-10:  Periodic  failure  rate  component  compared  with  a  real 
histogram  of  failures  over  a  one  day  period. 

The  values  of  ch  ,  s„  ,  and  s„„  were  obtained  from  the  histogram  of  failure  data  shown  in  5-6  and 

nw  sy  aWg 

m(t),  the  average  fraction  of  time  in  kernel  mode.  Figure  5-tO  shows  the  histogram  of  failures  and  the 
function  f(m(t),ie) 

f(m(t),ic)  *  chw  +  (s^  +  s^)  m(t)  +  s  m*(t)  (5.96) 

This  result  will  be  used  in  Chapter  7  to  evaluate  the  contribution  of  software  to  system  unreliability. 


5.4.  Probability  Distribution  Function  of  the  Time  to  Failure  of  a 
File  System 

The  modeling  methodology  presented  in  Chapters  3  and  4  can  be  used  to  characterize  the 
reliability  of  other  systems  or  resources  besides  a  complete  Time  Sharing  system.  As  a  final  example 
(which  will  be  also  validated)  the  PDF  of  the  time  to  failure  of  a  file  system  will  be  evaluated. 

For  a  file  system,  the  reasoning  is  that  errors  can  be  detected  only  when  accessing  it.  The 
assumptions  are  that  all  errors  are  hardware  transients  and  that  the  instantaneous  failure  rate  value  is 
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=  f(mdk(t),  x?k,  K )  (5.97) 

where 

b,  =  mdk(t)  +  xfk  (5.98) 

is  the  number  of  blocks  accessed  in  the  interval  [t-W/2,t  +  W/2].  Again,  it  is  assumed  that  mdk  is 
periodic,  xt  stationary  and  that 

A?k  =  cdk  +  sdkImdk(t)  +  x?kJ  (5-99) 

is  cyclostationary. 

Figure  5-11  shows  the  results  of  compiling  five  days  of  disk  utilization  samples  into  a  single  24  hour 
period.  Along  with  the  estimated  average,  this  figure  shows  the  function  mdk(t)  obtained  from  a  finite 
Fourier  series  expansion.  After  substracting  from  b{  the  value  of  mdk(t),  the  sampled  values  of  the 
process  xdk  are  available  for  estimation  of  its  autocorrelation  function. 

The  estimated  autocorrelation  of  xdk  also  suggests  that  an  approximation  of  the  form 
PM  -PM 

Ra(t)  =  o^e  1  +  a2e  2  (5.100) 

would  be  appropriate  to  approximate  the  real  autocorrelation  function. 


adk1  adk2  r.  ^2* 
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He 


where  the  following  constants  and  functions  have  been  defined 
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_  ,  °1  s2 
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(5.104) 
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Figure  5-11:  Estimated  and  approximated  value  of  mdk(t) 


Figu  re  5- 1 2:  Histogram  of  disk  failures  as  a  function  of  time  of  day. 
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The  hazard  function  is  given  by 


VT>  “  “dk 


dkl 


1  d*dk<T> 

^dk(T>  3t 


(5.106) 


Model 

Parameter 

Values 

Degrees  of 
Freedom 

X2  value 

*0.05 

Level  of 
Confidence 

Cyclostat. 

B||||g  i  Jig 

.  .... 

8 

8.69 

15.07 

0.36 

Stationary 

«c  =2.13 

*1c=1-42  • 

ff2c  =403 

=0.59 

P2  =0.21 

6 

8.642 

12.592 

0.19 

Table  5-2:  Results  of  applying  a  x2  goodness-of-fit  test  for  the 
Cyclostationary  and  Stationary  models  with  the  file  system  failure  data.  The 
hypothesis  that  the  models  are  good  abstractions  for  the  system  behavior  is 
confirmed  since  the  level  of  confidence  is  larger  that  0.05  in  both  cases. 

Table  5-2  gives  the  results  of  applying  a  x2  goodness-of-fit  test  to  the  file  system  failure  data. 
Again,  although  the  Cyclostationary  model  gives  a  superior  level  of  confidence  the  Stationary 
approximation  also  preforms  very  well.  Therefore,  if  great  accuracy  is  not  necessary,  some  of  the 
complexity  involved  in  the  manipulation  of  the  cyclostationary  expressions  can  be  saved  by 
neglecting  the  periodic  component.  Figure  5-13  shows  the  hazard  functions  of  for  both  the 
Cyclostationary  and  Stationary  approximations.  Note  the  small  range  of  variability  due  to  the  periodic 
component  of  the  failure  rate. 
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Figure  5-13:  Hazard  function  of  the  equivalent  non  homogeneous  Poisson 
process  characterizing  the  statistics  of  the  time  to  failure  of  a  file  system. 
Both  hazard  functions  (according  to  the  Cyclostationary  and  Stationary 
approximations)  have  been  plotted.  The  Mean  Time  To  Failure  is  7  minutes, 
that  would  correspond  to  a  constant  hazard  function  of  0.7  according  to  the 
Exponential  model.  The  two  dashed  lines  at  the  bottom  of  the  graph  enclose 
the  range  of  variability  of  the  hazard  function  due  to  the  periodic  component 
of  the  failure  rate  mdk(t).  Note  that  this  range  of  variation  can  be  neglected 
and  that  the  main  factor  characterizing  the  hazard  function  is  its  decreasing 
effect  due  to  the  integral  of  the  autocorrelation  function  Rxdkxdk(T). 


5.5.  Summary 

Both  the  Cyclostationary  and  Stationary  models  have  been  validated  as  suitable  descriptions  of 
failure  processes  in  Time  Sharing  computers.  Validation  has  been  performed  by  applying  x2 
goodness-of  fit  tests  to  the  PDF  of  the  time  to  failure  of  each  model  with  failure  data  obtained  from  a 
real  system.  Two  failure  processes  have  been  used  for  this  validation  :  a  file  system  failure  process, 
and  the  complete  system  failure  process  describing  the  statistics  of  the  time  to  crash.  The  mam 
conclusions  are : 

•  Predominance  of  the  decreasing  hazard  function  effect  due  to  the  integrated 
autocorrelation  function  of  the  stochastic  part  of  the  failure  rate. 

•  Marginal  importance  of  the  periodic  component  of  the  failure  rate  with  respect  to 
reliability  prediction. 
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•  Exponentially  decreasing  hazard  function  since  the  measured  autocorrelation  functions 
are  exponentials. 

•  Predominance  of  the  periodic  component  of  the  failure  rate  in  the  PDF  of  the  system 
failure  time  as  a  function  of  time  of  day. 

Obviously,  if  the  decreasing  rate  of  the  hazard  function  is  accepted  to  be  exponential  and  the 
periodic  component  is  neglected,  it  is  not  necessary  to  estimate  the  resource  utilization  functions. 
Instead,  the  values  of  a,  a,,  a2,  and  can  be  estimated  directly  from  a  history  of  failures.  This  is 
what  was  done  with  the  Stationary  approximations  presented  in  this  Chapter. 

The  properties  of  the  Cyclostationary  and  Stationary  models  are  further  discussed  in  the  following 
Chapter,  where  these  two  models  are  compared  (numerically  and  qualitatively)  with  the  other  three 
models  described  in  Chapter  2  :  Exponential,  Weibuil,  and  Periodic. 
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Chapter  6 
Discussion 


6.1 .  Reliability  modeling 

The  different  models  currently  used  to  characterize  the  reliability  of  digital  computing  systems  were 
summarized  in  Chapter  2.  In  this  section,  the  predictions  of  those  models,  the  predictions  of  the 
Cyclostationary  and  Stationary  models,  and  the  observed  behavior  of  the  system  described  in  Chapter 
4  are  compared.  In  Section  6.1.1  the  predictions  of  the  different  models  with  the  observed  system 
behavior  are  compared  by  means  of  numerical  statistical  tests.  In  Section  6.1 .2  the  assumptions  made 
by  each  model  are  compared,  along  with  some  of  their  most  general  properties.  The  main 
conclusions  of  this  Chapter  are  summarized  in  Section  6.3.  The  Reliability  function  and  hazard 
function  of  each  of  the  five  models  (Exponential,  Weibull,  Periodic,  Cyclostationary,  and  Stationary) 
are  summarized  in  Table  6- 1 . 

6.1.1.  Numerical  comparisons  :  statistical  tests 

Table  6-2  shows  the  results  of  applying  a  x  9oodness-of-fit  test  between  the  actual  failure  data  of 
the  CMU-10A  file  system  and  the  distributions  predicted  by  the  Exponential,  Weibull,  Periodic, 
Cyclostationary,  and  Stationary  models  using  appropriate  maximum  likelihood  estimates  for  each 
model.  A  x2  value  smaller  that  0.05  (i.e.,  a  level  of  confidence  greater  than  0.05)  indicates  a  good  fit 
between  predicted  and  observed  behavior  and  suggests  the  acceptance  of  the  hypothetical 
distribution  as  the  real  distribution  characterizing  the  failure  process. 

As  can  be  seen  from  Table  6-2  only  the  Cyclostationary  and  Stationary  models  show  a  clear  good  fit 
with  the  experimental  data.  Neither  the  Exponential  nor  the  Periodic  models  seem  to  be  able  to 
describe  the  failure  process  with  significant  accuracy.  The  Weibull  and  simplified  Stationary  models 
(obtained  by  approximating  the  autocorrelation  function  by  a  single  exponential)  give  levels  of 
confidence  close  to  0.05,  which  suggests  that  these  two  models  can  be  used  when  it  is  desired  to 
trade  some  accuracy  for  model  simplicity. 
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M  .  ri  *2r- 
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Table  6- 1 :  Reliability  and  Hazard  functions  of  the  five  compared  models. 
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The  Cyclostationary  model,  taking  into  account  both  the  periodic  workload  component  and  the 
integrated  autocorrelation  function,  gives  the  best  description  of  the  failure  process.  Figure  6-1 
shows  the  hazard  functions  of  the  above  five  models  in  the  case  of  file  system  failures. 

Table  6-3  gives  the  results  of  applying  a  x2  test  to  the  five  models  in  the  case  of  system  failures 
(crashes).  Again,  only  the  Cyclostationary  and  Stationary  models  give  levels  of  confidence  larger  than 
0.05.  The  hazard  functions  of  the  five  models  are  shown  in  Figure  6-2. 

remark:  Note  that  the  predominant  effect  is  that  of  having  a  decreasing  hazard  function  due  to  the 
integrated  autocorrelation  function.  Indeed,  neglecting  the  periodic  component  still  leads  to  an 
acceptable  level  of  confidence  for  the  Stationary  model.  On  the  other  hand,  neglecting  the  integrated 
autocorrelation  function  and  taking  into  account  only  the  periodic  workload  component  leads  to  a 
characterization  that  has  to  be  rejected,  as  the  level  of  confidence  of  the  Periodic  model  indicates. 

6.1.2.  Qualitative  comparisons 

As  it  has  been  shown  in  the  previous  section,  the  methodology  presented  in  this  thesis  seems  to 
lead  to  a  more  accurate  characterization  of  system  reliability  than  other  more  traditional  models.  Its 
widespread  use,  however,  is  doubtful  due  to  the  complexity  of  the  math  involved.  Although  the 
relevance  of  the  results  presented  in  this  thesis  is  discussed  in  Section  5.3.  and  later  on  in  Chapter  7, 
a  comparison  of  the  implicit  assumptions  and  general  properties  of  each  model  may  help  to  decide 
when  each  model  is  appropriate. 

6.1 .2.1 .  Failure  rate 

Table  6-4  lists  the  assumptions  made  by  each  model  concerning  the  failure  rate  of  digital 
computing  systems.  The  main  difference  between  the  Cyclostationary  and  Stationary  models  and  the 
three  traditional  models  is  that  traditional  models  assume  the  failure  rate  to  be  a  deterministic 
function  of  time,  while  the  Cyclostationary  and  Stationary  models  assume  the  failure  rate  to  be  a 
stochastic  process. 
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Model 

Parameter 

Values 

Degrees  of 
Freedom 

X2  value 

*0.05 

Level  of 
Confidence 

Exponential 

\e  =0.67 

7 

130 

14.067 

0 

Weibull 

Xw  -0.9T 
aw  =0.68 

8 

17.717 

15.507 

0.026 

Periodic 

s  =1.25 
cp  =0.28 

12 

1007 

21.026 

0 

Cyclostat. 

S  =  14.00 

C°c  =  2.01 

8 

8.69 

15.07 

0.36 

Stationary 

a  =2.13 
=>•« 

»*> “4°3 

0,  =0.59 

02=O.21 

6 

8.642 

12.592 

0.19 

Stationary 

(Simplified) 

a3  =1.69 
"si  -1-38 

0,  =1.38 

8 

19.434 

15.507 

0.013 

Table  6-2:  Results  of  a  x2  goodness- of-fit  test  with  the  Exponential,  Weibull, 
Periodic,  Cyclostationary,  and  Stationary  models  for  file  system  failures.  Only 
the  Cyclostationary  and  Stationary  models  give  levels  of  confidence  greater 
than  0.05.  The  Weibull  and  simplified  Stationary  models  give  smaller  levels  of 
confidence  but  close  to  0.05.  The  hypothesis  that  the  time  to  failure  can  be 
characterized  with  Exponential  or  Periodic  models  has  to  be  rejected.  The 
data  used  was  obtained  from  five  weekdays  of  system  operation  during  which 
877  (transient)  failures  were  detected.  The  MTTF  value  is  7  minutes.  The  file 
system  is  composed  of  8  RP06  disk  drives  totaling  1600  megabytes  of  on  line 
storage. 
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Figure  6-1:  Hazard  functions  predicted  by  Exponential,  Weibull,  Periodic, 
and  Cyclostationary  models  for  file  system  failures. 
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Model 

Parameter 

Values 

Degrees  of 
Freedom 

X2  value 

*005 

Level  of 
Confidence 

Exponential 

X  =0.0097 

e 

18 

120 

28.869 

0 

Weibull 

Xw  =0.0137 
«w  =0.61 

17 

28 

27.587 

0.045 

Periodic 

s  =0.01172 

C„  =0.0074 

P 

17 

119 

27.587 

0 

Cyclostat. 

s  =  2.23 
chw  =  0.0082 

18 

15.89 

28.869 

0.6 

Stationary 

a3  =0.082 
as1  =0.073 

0^  =0.00413 
0,  =0.285 

P2  =0.0039 

14 

15.89 

23.685 

0.3 

Stationary 
(1  exp.) 

os  =0.074 

0  .  =0.067 

0,  =0.22 

17 

13.61 

27.587 

0.7 

Table  6-3:  Results  of  a  x2  goodness- of-f it  test  with  the  Exponential,  Weibull, 
Periodic,  Cyclostationary,  and  Stationary  models  for  system  failures 
(crashes).  Again,  the  cyclostationary  and  Stationary  models  give  the  best  fit. 
The  data  used  was  obtained  from  6  months  of  system  operation  during  which 
243  crashes  due  to  transients  or  software  were  detected  (Nov.  1979  to  Apr 
1980).  The  MTTS  (Mean  Time  To  restart)  value  is  9  hours. 
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Figure  6*2:  Hazard  functions  predicted  by  Exponential,  Weibull,  Periodic, 
and  Cyclostationary  models  for  system  failures. 
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Model 

Failure  Rate 

Hazard  Function 

Exponential 

X 

Constant 

X 

Constant 

Weibull 

Xa 

Decreasing 

Xa 

Decreasing 

(Xt)a 

(Xt)“ 

Periodic 

m(t) 

Periodic 

m(t) 

Periodic 

Cyclostationary 

m(t)  +  x{ 

Cyclost.  process 

k  m(t)  i)(t) 

Deer,  modulated 
by  per.  function 

Stationary 

m  +  x{ 

V(Q 

Decreasing 

Table  6-4:  Failure  rates  and  hazard  functions  assumed  by  each  of  the  five 
models 


6.1 .2.2.  Hazard  function 

Loosely  speaking,  the  difference  between  failure  rate  and  hazard  function  is  the  difference  between 
what  actually  happens  and  what  can  be  easily  observed.  The  evaluation  of  the  exact  failure  rate  at  a 
particular  time  in  a  computer  may  be  an  interesting  mathematical  exercise  (that  of  statistical  inference 
of  the  value  of  a  random  variable  from  some  of  its  after  effects,  i.e.,  failures).  But  conceptually, 
reliability  characterization  is  easier  in  terms  of  the  hazard  function. 

This  distinction  between  failure  rate  and  hazard  function  is  not  usually  made  in  the  Exponential, 
and  Weibull  models.  Failure  rate  and  hazard  function  are  identified  with  the  same  time  functions  for 
those  two  models.  A  hazard  function  can  be  derived  for  the  periodic  model  by  averaging  the  value  of 
the  failure  rate  for  all  possible  system  starting  times  (a  simple  calculation  will  show  that  the  hazard 
function  for  the  periodic  model  is  proportional  to  the  squared  failure  rate). 


Recall  that  if  h(t)  is  the  hazard  function,  h(t)At  is  the  probability  of  observing  a  failure  in  the 
infinitesimal  interval  [t,t  At].  Thus,  for  the  exponential  model  any  interval  has  the  same  probability  of 


DISCUSSION 


103 


containing  a  failure.  For  the  Weibull  model,  the  probability  decreases  with  time.  For  the  periodic 
model  this  probability  is  also  periodic.  Both  for  the  cyclostationary  and  stationary  models  this 
probability  is  decreasing.  The  point  is  that  for  the  cyclostationary  and  stationary  models  the  hazard 
function  has  been  obtained  after  computing  the  expectation  for  all  possible  realizations  of  the  failure 
rate. 

Therefore,  it  is  not  maintained  here  that  the  probability  of  observing  a  failure  in  an  infinitesimal 
interval  actually  decreases  with  time.  What  is  maintained  here  is  that  if  the  behavior  of  many  systems 
is  observed,  or  if  the  behavior  of  a  single  system  is  observed  for  a  sufficiently  long  time  interval,  the 
measured  parameters  will  look  as  if  the  infinitesimal  probability  would  decrease  with  time.  But  the 
actual  infinitesimal  probability  for  a  particular  system  at  a  particular  moment  in  time  is  a  random 
variable,  namely,  its  failure  rate  at  that  moment. 

6. 1.2. 3.  Reliability  Function 

Further  insight  into  the  implicit  implications  of  using  each  of  the  five  models  can  be  gained  by 
comparing  their  Reliability  functions.  Recall  from  Chapter  2  that  the  Reliability  function  is  the 
probability  that  no  failure  will  be  observed  before  time  t.  Only  three  Reliability  functions  will  be 
compared  :  Exponential,  Weibull,  and  Stationary,  given  in  (6.1),  (6.3),  and  (6.9).  Figure  6-3  shows  the 
above  three  reliability  functions  for  the  file  system  failure  data.  Only  these  three  models  are  compared 
to  provide  a  clear  idea  of  their  main  differences  and  similarities.  The  Exponential  model  is  the  most 
widely  used  in  reliability  theory.  The  Stationary  model  gives  a  good  fit  with  experimental  data  while  not 
being  as  complex  as  the  Cyclostationary  model.  And  the  Weibull  model  is  the  closest  previous 
approximation  to  the  methods  presented  in  this  thesis.  Note  from  Figure  6-3  that  for  values  of  t 
smaller  than  14  minutes  (about  twice  the  MTTF  value)  the  Stationary  and  Weibull  models  essentially 
agree  in  their  predictions  while  the  Exponential  model  predicts  reliability  values  larger  than  the  other 
two  models.  For  values  of  t  larger  than  14  minutes,  the  Exponential  model  predicts  reliability  values 
smaller  than  the  predictions  of  the  Stationary  and  Weibull  models,  the  larger  predictions 
corresponding  to  the  Weibull.  Figure  6-4  shows  the  same  three  reliability  functions  for  the  case  of 
system  failures.  Again,  the  Exponential  model  gives  reliability  predictions  up  to  20%  larger  than  the 
other  two  models  for  small  values  of  t,  and  too  small  reliability  values  for  large  values  of  t.  In  this  case 
crossover  occurs  at  t  =  13  hours,  about  1 .5  times  the  MTTF  value. 

If  the  Stationary  model  is  accepted  as  the  best  descriptor  of  the  file  system  reliability  (which  is  a 
reasonable  thing  to  do  after  examining  the  values  of  the  x2  test  shown  in  Section  6.1 .1 )  the  following 
two  conclusions  are  reached  : 
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Figure  6-4:  Reliability  functions  predicted  by  the  Exponential,  Weibull,  and 
Stationary  models  for  system  failures  (crashes). 
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•  For  small  values  of  t,  the  reliability  predictions  of  the  Exponential  model  are  essentially 
too  optimistic.  The  actual  reliability  is  lower  than  predicted  by  the  Exponential  model. 

•  For  large  values  of  t,  the  reliability  predictions  of  the  Exponential  model  too  pessimistic. 
The  system  is  actually  more  reliable  than  predicted  by  the  Exponential  model. 

•  The  differences  in  reliability  prediction  are  specially  important  for  values  of  t  smaller  than 
the  Mean  Time  To  Failure,  where  the  Exponential  model  differs  by  almost  20%  with  the 
Weibull  and  Stationary  models. 

•  Reliability  predictions  of  the  Stationary  and  Weibull  models  are  within  5%  through  all 
range  of  values  of  time. 


Overall,  the  results  presented  here  are  consistent  with  the  results  presented  in  [McConnel  81].  This 
is  important  because  the  analysis  done  by  [McConnel  81]  with  the  Weibull  distribution  was  extended 
to  redundant  systems  (duplex,  triplex  and  TMR)  for  which  the  same  behavior  was  observed. 


6.2.  A  possible  new  design  parameter 

Assume  now  that  the  autocorrelation  function  of  the  fraction  of  time  in  kernel  mode  is  somehow 
under  the  control  of  system  designers.  Maximum  reliability  would  be  obtained  if 

Rxx<t>‘  =  (%/%) 

That  is,  the  stochastic  component  of  the  system  failure  rate  would  be  white  noise.  In  this  case,  the 
PDF  of  the  time  to  failure  would  become  an  exponential  with  parameter 

\  =  «  •  (aw  (6.12) 

**1  ”2 

The  system  would  still  be  able  to  do  the  same  amount  of  work  in  the  sense  that  the  average  fraction  of 
time  in  kernel  mode  is  independent  of  the  shape  of  its  autocorrelation  function. 

If  such  an  autocorrelation  function  could  be  obtained  on  the  CMU-10A,  the  MTTF  value  would  be 
16  hours,  compared  with  the  real  MTTF  value  of  9  hours  obtained  with  an  exponentially  decreasing 
hazard  function.  Figure  6-5  shows  the  reliability  functions  obtained  from  the  Stationary  approximation 
considering  both  an  exponentially  decreasing  hazard  function  and  a  delta  function  (white  noise). 
Although  pure  white  noise  is  impossible  to  obtain  physically  (it  has  an  infinite  bandwidth),  the  fact  that 
faster  decreasing  rates  for  the  autocorrelation  function  means  also  more  reliable  systems  is  a  new 
factor  to  take  into  account. 
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Recall  from  Section  4.1. 2.1  that  the  distinctive  property  of  white  noise  is  that  it  is  unpredictable. 
Thus,  designing  a  system  with  a  faster  decreasing  hazard  function  means  also  removing  some 
predictability  from  its  behavior.  Clearly,  this  indicates  a  tradeoff  between  performance  and  reliability 
since  many  algorithms  to  enhance  system  performance  (for  instance,  in  scheduiling  and  paging)  are 
precissely  based  on  predicting  future  system  behavior.  However,  it  is  not  clear  at  present  how  the  rate 
at  which  the  hazard  function  decreases  (that  is,  the  shape  of  the  failure  rate  autocorrelation  function) 
can  be  controlled. 


6.3.  Summary 

Three  different  known  model  used  to  characterize  the  reliability  of  digital  computers  have  been 
compared  with  the  two  main  modeling  methods  presented  in  this  Thesis  (Cyclostationary  and 
Stationary)  and  with  actual  failure  data  collected  on  the  computing  system  described  in  Chaper  5. 
Statistical  test  performed  with  two  different  failure  processes  clearly  suggest : 

•  Acceptance  of  the  Cyclostationary  and  Stationary  modeling  methods  as  suitable  tools  to 
characterize  system  reliability. 

•  Rejection  of  the  Exponential  and  Periodic  models  as  accurate  descriptions  of  computers 
failure  processes.  The  only  exception  may  be  the  use  of  the  Exponential  model  when 
simplicity  has  absolute  priority. 

•  Acceptance  of  the  Weibuff  and  simplified  Stationary  models  as  marginally  accurate 
descriptions  of  system  reliability.  They  are  not  as  good  as  the  Cyclostationary  or 
stationary  models  nor  as  bad  as  the  Exponential  or  Periodic. 

•  Introduction  of  a  possible  new  design  parameter:  the  rate  at  which  the  hazard  function 
decreases  to  its  asymptotic  value. 


Qualitative  comparisons  between  the  Exponential  model  and  the  Stationary  and  Weibull  models 
have  confirmed  the  findings  of  [McConne!  81],  that  is, 

•  The  Exponential  model  is  too  optimistic  when  predicting  reliability  for  small  values  of  t. 

•  The  Exponential  model  is  too  pessimistic  when  predicting  reliability  for  large  values  of 
t.  The  Weibull  model  has  been  found  too  optimistic  when  predicting  reliability  for  large  t. 


Clearly,  the  validity  of  a  modeling  methodology  cannot  be  confirmed  or  denied  by  the  results  of  a 
single  experiment.  However,  the  results  obtained  so  far  are  encouraging  and  justify  a  more  detailed 
study.  Therefore,  the  next  Chapter  is  dedicated  to  elaborate  some  applications  derived  from  the 
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t  (Hours) 

Figure  6-5:  Reliability  functions  obtained  from  the  Stationary  model  by 
considering  the  real  autocorrelation  function  and  white  noise. 
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Cyclostationary  and  Stationary  modeling  methods.  In  some  cases,  these  applications  will  be 
independent  of  the  results  obtained  until  now.  However,  they  are  included  because  they  are  natural 
extensions  to  the  philosophy  used  through  the  Thesis. 
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APPLICATIONS 


111 


Chapter  7 
Applications 

The  previous  Chapters  have  shown  how  the  Cyclostationary  and  Stationary  modeling  methods  are 
the  correct  approach  to  characterize  computing  systems  reliability  operating  under  periodic  or 
constant  workload  respectively.  Further,  it  has  been  shown  how  the  Stationary  model  may  be  just 
enough  to  predict  reliability  even  for  systems  under  periodic  workload,  since  the  effect  of  having  a 
periodic  workload  in  the  failure  rate  is  minor  compared  with  the  effect  of  considering  the  failure  rate 
to  be  a  Gaussian  process,  and  therefore  having  a  decreasing  hazard  function. 

Nevertheless,  as  will  be  shown  in  Sections  7.1  through  7.3,  that  workload  periodicity  can  still  be 
used  to  obtain  some  new  results  related  to  reliability  characterization.  The  first  contribution  is 
presented  in  Section  7.1,  where  it  is  shown  how  the  contributions  of  software  and  hardware  errors 
can  be  easily  evaluated. 

It  was  stated  in  Chapter  2  that  one  of  the  main  problems  associated  with  the  acceptance  of  fault- 
tolerance  as  a  more  desirable  attribute  of  general  purpose  computing  systems  was  the  fact  that 
performance  evaluation  and  reliability  characterization  are  unconnected.  Thus,  in  Section  7.2  an 
attempt  is  made  to  elaborate  an  integrated  Performance/Reliabiiity  model. 

In  Section  7.3  the  problem  of  determining  the  optimum  checkpointing  interval  in  a  transaction 
processing  system  is  revisited  and  refined.  The  purpose  of  this  section  is  to  determine  if  the  modeling 
methods  presented  in  this  thesis  in  any  way  invalidate  or  confirm  previously  obtained  results. 

Finally,  in  Section  7.4  a  first  step  is  given  in  a  completely  new  area:  modeling  the  effects  of 
hardware  transients,  software  faults,  and  permanent  hardware  faults.  The  main  conclusions  of  the 
Chapter  are  summarized  in  Section  7.5, 
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7.1 .  The  impact  of  unreliable  software  on  the  observed  system 
reliability 

As  it  was  described  in  Chapter  2,  current  software  reliability  modeling  and  measurement  efforts 
concentrate  in  the  evaluation  of  static  software  attributes.  Some  of  these  attributes  are  the  number  of 
bugs  present  in  a  software  package,  or  the  mean  time  between  software  failures  of  a  set  of  programs 
operating  in  a  controlled  environment.  Here,  however,  the  evaluation  will  refer  to  the  observed 
behavior  of  systems  operating  in  the  field  under  dynamically  changing  conditions.  Further,  software 
reliability  models  usually  refer  to  parameters  of  interest  to  the  software  development  team,  while  here 
an  effort  will  be  made  to  quantify  the  impact  of  software  unreliability  to  the  average  user  of  a  Time 
Sharing  system  and  to  the  user  community. 

Perhaps  the  simplest  question  to  be  asked  is  whether  a  given  system  failure  is  due  to  a  hardware 
transient  or  to  a  software  fault.  Most  operating  systems  provide  some  tools  to  help  answering  such 
question.  The  most  primitive  tool  is  just  a  memory  dump  that  has  to  be  manually  analyzed  to  resolve 
the  cause  of  the  failure.  Other  systems  provide  more  information  in  an  error  log.  And  some  systems 
even  attempt  to  automatically  classify  all  failures.  However,  some  experience  using  such  tools  soon 
teaches  the  difficulty  of  the  problem.  Except  for  a  few  clear  hardware  failures  (a  hard  memory  parity 
error  while  accessing  one  of  the  kernel  data  structures)  most  failures  usually  remain  unresolved. 
Assume  that  a  system  is  hung  in  an  infinite  loop  in  the  kernel  and  the  system  has  to  be  manually 
crashed  by  the  operator.  How  can  it  be  known  if  a  part  of  code  was  overwritten  by  the  software  itself 
or  if  an  undetected  transient  altered  the  destination  address  of  a  jump  instruction? 

The  method  proposed  here  to  resolve  such  ambiguities  is  probabilistic.  Although  each  system 
failure  is  due  to  a  particular  cause,  to  learn  the  exact  cause  for  each  failure  with  a  reasonable  level  of 
confidence  may  be  extremely  costly.  The  method  proposed  here  will  give  only  expectations  and 
averages.  But  it  is  substantially  cheaper. 

It  was  shown  in  Section  5.3.3  how  the  instantaneous  system  failure  rate  at  each  moment  in  time  is 
given  by 

-  s.  *  t«h.  *  •„<<*,  (?.') 

which  can  be  viewed  as  the  superposition  of  the  hardware  and  software  failure  rates. 
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*r  -  (Ssw/^  ^sw^, 

V  =  Sw  +  «JS 


(7.2) 

(73) 


Since  k{  =  m(t)  +  x{,  it  is  convenient  at  this  point  to  indicate  the  dependency  on  x(  more  explicitly 

AtSW<xt>  =  <ssw1m(t)  +  ssw2)(m(t)  +  xt) 

=  chw  +  shJmM  +  xJ 
such  that  the  system  failure  rate  is  given  by 

Xfy(xt)  =  A^(x()  ♦  Xthw(x() 

The  system  failure  process  can  therefore  be  viewed  as  a  marked  doubly  stochastic  Poisson  process, 
each  failure  being  associated  with  a  mark  specifying  if  it  is  hardware  or  software  related.  Given  that  a 
failure  has  occurred  at  time  tf,  the  probability  that  this  failure  is  due  to  software  is 


(7.4) 

(7.5) 

(7.6) 


P,«<V  -  E  ( V<\> 


-} 


>  '  ^“(x.l  +  X, “(x.) 

‘f  t  'f  1 

where  the  expectation  is  taken  with  respect  the  statistics  of  x,  and 

lf 

=  1  ‘  P3W<V 

Hence, 


(7.7) 


(7.8) 


Psw(tf)  = 


(2  w)1/2a, 


• f 

Kin 


n(t,) 


(s3W<m(t,)  +  ssw^)[m(tf)  +  u] 


+  [s. 


hw  +  Ssw,m^  +  Ss 


][m(t,)  +  u] 


•u2/2 o  2 

t  *  du 


(7.9) 


where  the  restriction  of  having  a  strictly  positive  failure  rate  has  been  taken  care  of  in  the  lower  limit  of 
the  integral. 


Figure  7-1  shows  the  probability  that  a  crash  is  due  to  a  software  error  as  a  function  of  the  time  of 
day  for  the  CMU-10A  after  computing  the  maximum  likelihood  values  of  the  coefficients  according  to 
Section  5.3.4.  Since  the  linear  term  in  the  failure  rate  ssy  cannot  be  separated  in  its  software  and 
hardware  components  (ssw  and  shvJ  Figure  7-1  shows  the  upper  and  lower  bounds  obtained  by 


assuming  s, 


SW, 


:  s,^  and  s.  =  0.  On  the  average,  it  seems  that  software  accounts  for  60%  of  the 

sy  8Wj 


crashes  while  the  remaining  40%  is  due  to  hardware.  This  is  a  misleading  interpretation  because  the 
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Figure  7-1:  Probability  that  a  crash  is  due  to  software  or  hardware  as  a 
function  of  the  time  of  day 


system  does  not  crash  at  any  time  with  equal  probability.  Consider  a  set  of  failures  observed  at  times 
t,  ,...,t,  •  The  expected  number  of  failures  due  to  software  is 

l1  rN 


EfnswlN;ti1'"',tfN)  =  psviAk) 


(7.10) 


This  expectation  has  been  computed  for  the  set  of  243  crashes  observed  in  six  months  of  operation 
of  the  CMU-10A.  Since  the  system  crashes  more  often  at  the  times  that  the  contribution  of  unreliable 
software  is  larger,  67%  of  the  crashes  are  due  to  software.  But  it  is  still  possible  to  refine  this  number. 
The  impact  of  each  crash  depends  on  the  number  of  jobs  being  executed  at  the  time  of  crash.  Figure 
7-2  shows  the  average  number  of  jobs  executing  in  the  CMU-10A  as  a  function  of  time  of  day.  Given 
that  a  crash  occurs  at  time  t(,  the  expected  number  of  jobs  crashed  due  to  software,  E[JSW],  is 


Etny  =  P^Efj.] 


(7.11) 


where  E[J,  ]  is  the  expected  number  of  jobs  executing  at  time  t,.  Given  a  set  of  N  failures  at  times 

T  T 

t.  ,  the  expected  number  of  jobs  aborted  due  to  software  is 

*4  Ui 


E[Jsw|N;tf  . tf  ]  =  £Lp9w<t()E[Jt  ] 

1  N  sw  ’k  \ 


(7.12) 


The  value  obtained  for  the  CMU-10A  is  that  70%  of  the  jobs  aborted  in  system  crashes  do  so  because 
of  software  errors.  A  percentage  substantially  higher  than  the  60%  originally  computed  for  the 
probability  that  a  single  crash  is  due  to  software.  These  results  are  summarized  in  Table  7-1 . 


Number  of  Jobs 
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Figure  7-2:  Average  number  of  jobs  executing  as  a  function  of  time  of  day 


Range  of  variability  of  the  probability 
that  a  crash  is  due  to  software  depending  on 
the  time  of  day  at  which  the  crash  occurs 

45%-75% 

Probability  that  a  crash  is  due  to  software 
averaged  over  one  day  period 

60% 

Expected  percentage  of  crashes  due  to 
software  during  6  months  of  operation 
of  theCMU-lOA 

67% 

Expected  percentage  of  jobs  aborted  due  to 
software  during  6  months  of  operation 
of  the  CMU-10A 

70% 

Table  7- 1 :  Different  views  of  the  imnact  of  software  in  svstem  unreliability 
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7.2.  Performance/Reliability  evaluation 


7.2.1 .  The  user’s  viewpoint 

From  a  user’s  viewpoint,  working  with  an  unreliable  system  has  an  added  cost  that  would  not  be 
present  in  a  failure  free  system.  This  added  cost  due  to  unreliability  is  mainly  due  to  two  factors: 

•  A  possible  delay  in  finishing  the  user’s  task.  The  system  may  fail,  remain  unavailable  for  a 
while,  and  parts  of  the  programs  being  executed  may  have  to  be  repeated  afterwards. 

The  expected  time  required  to  complete  a  task  is  therefore  longer  in  an  unreliable  system 
than  in  a  failure  free  system 

•  The  cost  associated  with  repeated  computations.  That  is,  the  cost  associated  with  the 
use  of  resources  that  effectively  may  be  useless,  since  the  system  may  fail  and  some 
computations  may  have  to  be  repeated. 

These  costs  will  be  quantified  for  a  CMU-10A  user  in  this  section.  The  approach  is  essentially  the 
same  that  as  in  [Castillo  80a].  The  problem  of  evaluating  the  added  cost  due  to  unreliability  is 
visualized  in  Figure  7-3. 


Start  Restart  Restart  Restart 


End  of 
execution 


T 


Figure  7*3:  Typical  system  of  events  illustrating  the  unreliable  behavior  of  a 
computing  system  from  a  user  viewpoint 

A  program  is  started  at  time  tQ  and  failures  occur  at  times  tv  t2  ....  such  that  after  each  failure  the 
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program  has  to  be  restarted.  Complete  execution  terminates  after  the  system  has  been  operating 
continuously  for  a  time  T 

The  total  elapsed  time  since  the  user  starts  the  program  until  the  the  program  correctly  completes 
execution,  T,  is  equal  to  Treal  plus  Trec,  time  during  which  the  program  was  executing  but  wasted 
because  the  system  failed  before  the  program  finished  execution. 


Treat  is  a  random  variable  whose  statistics  depend  on  the  resources  needed  by  the  program  to 
complete  execution  (CPU  time,  storage  requirements,  etc.)  and  on  the  system  workload  during 
program  execution  (i.e.,  at  which  rate  are  these  resources  provided  by  the  operating  system 
depending  on  competing  requests  by  other  users).  Tfec  is  another  random  variable  whose  statistics 
depend  on  Treaj  and  and  on  the  statistics  of  the  time  to  failure.  The  total  expected  cost  (in  terms  of 
time)  incurred  in  executing  the  program  is 


E[CT]  =  E[Trec]  ♦  E[Trea(] 


(7.13) 


where  the  first  term  in  (7.13)  is  obviously  the  added  cost  due  to  unreliability  in  the  sense  that  it  would 
be  zero  if  the  program  were  executed  in  a  failure  free  system.  The  failure  process  will  be  assumed  to 
be  stationary  and  the  average  workload  will  be  assumed  to  be  constant.  The  expected  cost  is  then 
given  by 


E[C 


T1  ■  l  *  E|T„J 


rec1  real ' 

Given  Trea|,  the  expected  value  of  is 

•*]■£",  W, -n|T,„,.x)  EtTec|Te>1 ,  x;N,  -  n] 


(7.14) 


(7.15) 


P(nf  *  n(Treaj  =  x)  is  the  probability  that  the  program  is  restarted  n  times  given  that  it  requires  x  units  of 
time  of  continuous  system  operation  and  is  given  by  given  by 


p(n,  =  n|Tre|  =  x)  =  [Pf(r<x]nP(T>x) 

If  tf  is  the  time  from  restart  to  failure, 

EtTrJTr6a.  =  X:N1  =  nl  =  nEItriT^lBXl 

Substituting  now  (7.17)  and  (7.16)  in  (7.15) 


(7.16) 


(7.17) 


EtTreclTrea,  =  *1  "  EtN,lTrea.  "  *1  EtMTrea,  =  X1  <718> 

That  is,  the  expected  value  of  Tfec  is  equal  to  the  expected  number  of  failures  multiplied  by  the 
expected  time  from  restart  to  failure  given  that  T.eal  =  x.  The  expected  number  of  failures  is 
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EtNflTreal  = x]  =  En  =0  n  [Pf(r<x)]n  P,(r>X) 
P,(r<x) 

=  P,(r>x) 


(7.19) 

(7-20) 


The  distribution  of  the  time  from  restart  to  failure  given  that  a  failure  occurs  before  x  units  of  time  is 


equal  to  the  distribution  of  the  time  to  failure  truncated  at  time  r  =  x,  that  is, 


where  p  (t)  is  the  pdf  of  the  time  to  failure  and  U(t)  is  the  step  function 


U(t)  £  o 


1  if  t>0 


otherwise 


Therefore, 


oo 


-pJ—  C  Ety  -  Etf,(x)l  ] 


where 


r 

f,(x)]*  jf  tPlf( 


Substituting  now  (7.24)  and  (7.20)  in  (7.18)  the  following  result  is  obtained 


(7.21) 

(7.22) 

(7.23) 

(7.24) 

(7.25) 


«r~>-eWj£  ^ 


00  PT  (x) 

real 

Pf(r<x]) 


r°°  pt  ' 

/  real 

Jo  P,l 


(X)  E[t’f(x)l 


(7.26) 


Figure  7-4  shows  the  expected  elapsed  time  required  to  execute  a  program  at  three  different  times 
of  day  for  different  values  of  T^  .  For  each  curve,  the  straight  line  represents  the  second  term  in 
equation  (7.13),  that  is,  it  is  the  expected  elapsed  time  due  to  workload  only.  The  solid  line  represents 
the  total  expected  elapsed  time.  At  12:00,  the  contribution  of  unreliability  to  the  expected  elapsed  time 
of  a  program  requiring  30  minutes  of  CPU  is  of  30%.  The  curves  were  obtained  by  actually  measuring 
the  distribution  of  the  elapsed  time  required  to  execute  a  CPU  bound  program  at  the  three  times  of 
day  in  the  absence  of  errors.  The  mean  time  to  failure  at  each  time  of  day  was  measured  by  counting 
the  number  of  crashes  occurred  in  two  hour  time  slots  centered  at  each  of  the  three  times 
considered. 


■[TlTmin]  (I 
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7.2.2.  The  manager’s  viewpoint 

In  the  previous  section  it  has  been  shown  how  the  added  cost  due  to  unreliability  can  be  evaluated 
for  a  user  trying  to  complete  a  task  given  the  resources  needed  by  his  task  and  the  system  workload 
patterns.  Here,  a  measure  will  be  developed  of  potential  use  to  system  administrators.  The  idea  is  to 
evaluate  the  cost  due  to  unreliability  of  a  computer  system  operating  as  a  server  of  computing  utility. 


Let  Jt  be  the  number  of  active  jobs  at  time  t.  For  fixed  t,  J,  will  be,  in  general,  an  integer  valued 
random  variable.  is  therefore  an  integer  valued  stochastic  processes.  Assume  that  a  system  failure 


occurs  at  time  t,  .  The  added  cost  due  to  that  failure  can  be  evaluated  as, 

.j. 


c<*,  )  .  J  c  ♦  H,'’,  c 

1  “l  I 


(7.27) 


where  J.  C .  is  the  cost  associated  with  the  time  that  the  system  is  down  (which  will  be  assumed  to  be 

*fi  d 

fixed  for  failures  due  to  transients  and  software)  and  Cr  is  the  cost  associated  with  the  recovery  of  the 

i 

i-th  job.  Assume  now  that  the  system  has  been  operating  for  Nd  days,  during  which  Nf  failures  have 
occurred  at  times  t.  ,...,tw  .  The  expected  added  cost  due  to  unreliability  during  these  N  .  days  is 

1  '  j 

C(Nd|Nf;tf  . tN )  =  e{  EkL  !  J.  Cd  }  +  E  {  Ek L  i  Li  J1!  Cr  }  (7.28) 

If  the  recovery  cost  for  any  job  is  independent  of  the  number  of  jobs  active  at  the  time  of  failure,  and 

the  Cr  are  assumed  to  be  identically  distributed  random  variables, 

C(Nd|N,;tfj . tH)  =  Ekl  i  E{J^}  Cd  +  Ekl ,  E  }  E  {Cf }  (7.29) 

According  to  the  results  presented  in  Chapter  3,  given  that  Nf  failures  have  occurred,  each  of  the  t( 

has  a  distribution  over  a  one  day  period  equal  to  the  periodic  component  of  the  failure  rate,  f(m(t),i?). 
Thus 

-T 

C(Nd|Nf)  =  N(  [  Cd  +  E  {Cr  }  j  jf  f(m(t),K  )  E  {J,}  dt  (7.30) 

where  it  has  been  assumed  that  a  stationary  distribution  exists  for  the  C  .  Finally,  since  neither  C  , 

ri  i 

m(t),  or  J{  depend  on  N{, 

•  T 

C(Nd)  =  E{Nf}  [  Cd  +  E{C  }  ]  Jo  f(m(t),K  )  E{Jt)  dt  (7.31) 


For  a  system  administrator,  the  interesting  question  is  whether  the  policies  regulating  the  use  of  the 
system  can  be  modified  such  that  the  above  cost  is  minimum,  while  simultaneously  executing,  on  the 
average,  the  same  number  of  jobs  per  unit  time. 
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E{Nf}  is  the  expected  number  of  failures  in  f\ld  days  and  can  be  reduced  only  by  improving  the 
hardware  or  the  operating  system.  Cd  is  the  cost  associated  with  the  system  down  time  after  a  failure, 
which  presumably  will  already  be  as  small  as  possible.  E{C  }  is  the  cost  associated  with  the  abortion 

i 

of  user  jobs,  and  depends  on  users  patterns  of  use,  programming  style,  and  so  on.  The  only  term  left 
for  the  administrator  to  play  with  is  the  variable  cost  associated  with  the  system  workload  variations. 
Let  Cu  be  equal  to  the  added  cost  depending  on  workload  variations 

Cu  =  j£  f(m(t),x  )  E[J{]  dt  (7.32) 

Since  f(m(t),i?)  is  a  polynomial  on  m(t),  Cu  will  be  minimum  when  m(t)  =  m,  the  mean  value  of  m(t). 
Thus,  allowing  the  workload  to  vary  around  its  mean  value  has  an  associated  cost  in  itself.  For  the 
CMU-10A  the  periodic  variations  actually  increase  the  cost  due  to  unreliability  by  8%  in  the  sense  that 
Cu  =  0.13  and  for  constant  workload  C™n  =  0.12.  Obviously,  the  number  of  jobs  processed  in  one  day 
is  the  same  in  both  cases. 

Dividing  (7.31)  by  Nd,  the  added  cost  due  to  unreliability  per  unit  time  is  obtained 

c  ■  mttf  c  c»  *  e«y  ] c.  <7  331 

This  expression  is  important  as  it  includes  all  factors  for  which  unreliability  has  an  associated  added 
cost.  From  (7.33)  is  seen  that  doubling  the  MTTF  value  actually  decreases  the  added  cost  in  half.  It 
has  already  been  shown  how  Cu  increases  this  cost  as  a  consequence  of  workload  periodicity. 
Finally,  C.  is  usually  going  to  be  small  compared  with  E{C  },  the  recovery  cost  associated  with  each 

a  ri 

job.  Thus,  one  way  to  reduce  the  added  cost  due  to  unreliability  is  to  reduce  the  expected  recovery 
cost  from  failures.  The  next  section  shows  how  the  recovery  costs  can  be  reduced  by  introducing 
checkpointing. 


7.3.  On  the  optimum  checkpointing  interval 

To  diminish  the  added  cost  due  to  unreliability  several  alternatives  are  possible  according  to 
expression  (7.31).  Assuming  that  hardware  and  operating  system  reliability  are  given  and  that  the 
workload  patterns  cannot  be  changed  there  is  still  a  way  by  which  the  cost  associated  with  delays  and 
repeated  computations  can  be  reduced.  Assume  that  at  certain  points  in  time  called  checkpoints  a 
copy  of  the  program  memory  image  and  data  structures  is  made  and  stored  in  some  secondary 
storage  medium.  Figure  7-5  shows  a  typical  sequence  of  events  when  checkpointing  is  possible.  If  a 
failure  occurs  before  the  program  completes  execution,  the  copy  of  the  program  image  at  the  most 


122 


A  COMPATIBLE  HARDWARE/SOFTWARE  RELIABILITY  PREDICTION  MODEL 


Crash  Recovery 


)  J 

l _ : _ 

_D 

□ 

Checkpoint  Checkpoint  Checkpoint  Time 


Figure  7-5:  Typical  sequence  of  events  in  a  system  with  checkpointing 
facilities.  The  total  added  cost  due  to  unreliability  is  the  cost  associated  with 
the  checkpoint  operation,  plus  the  cost  due  to  system  unavailability  due  to 
failures,  plus  the  cost  of  recovering  after  each  failure  to  the  state  given  by  the 
last  checkpoint. 

recent  checkpoint  is  restarted.  Thus,  only  the  computations  performed  since  the  last  checkpoint  have 
to  be  repeated.  Since  the  checkpoint  operation  has  also  an  associated  cost,  the  problem  is  to 
estimate  the  time  between  checkpoints  such  that  the  overall  added  cost  (cost  due  to  checkpoints  and 
cost  due  to  failures)  is  minimized. 

Checkpointing  is  rarely  used  in  Time  Sharing  systems  except  in  programs  where  loss  of  data  due  to 
a  failure  is  specially  inconvenient  (such  as  editors  or  electronic  mail  programs).  However,  it  is 
extensively  used  in  Real  Time  systems  and  in  transaction  processing  systems,  where  at  each 
checkpoint  a  copy  of  the  database  is  made,  and  recovering  from  a  failure  means  to  bring  the 
database  to  the  last  consistent  state  and  reprocess  the  transactions  arrived  since  the  last  checkpoint. 

Because  of  its  importance,  the  problem  of  determining  the  optimum  checkpointing  interval  has 
received  considerable  attention.  Table  7-2  is  a  summary  of  the  most  relevant  models  proposed  for  the 
evaluation  of  the  optimum  checkpointing  interval.  For  each  model  a  reference  is  given,  the  main 
assumptions  in  the  model,  and  the  decision  criteria  used  to  determine  the  optimum  checkpointing 
interval.  Most  of  these  models  have  been  surveyed  in  [Chandy  75b]. 

The  purpose  of  this  section  is  to  investigate  if  the  modeling  methodology  presented  in  this  thesis 
confirms  or  invalidates  the  results  given  by  the  models  presented  in  Table  7-2  and  to  study  if  a  refining 
of  these  results  is  possible. 


APPLICATIONS 


123 


Reference 

Assumptions 

Goal 

[Young  74] 

-  Constant  workload 

-  Constant  failure  rate 

-  No  errors  during  check. 

Maximize 

Availability 

[Chandy  75a] 

•  Constant  workload 

-  Constant  failure  rate 

-  Errors  occur  during  check. 

Maximize 

Availability 

[Chandy  75b] 

•  Periodic  Workload 
■  Periodic  failure  rate 

Maximize  number 
of  transaction  processed 

[Gelenbe  78] 

-  Constant  workload 

-  Constant  failure  rate 

-  Errors  occur  during  check. 

Minimize 
response  time 

Table  7-2:  Four  proposed  models  to  evaluate  the  optimum  checkpointing 
interval  in  a  transaction  processing  system 


7.3.1 .  Constant  workload 

If  the  workload  is  constant  the  failure  process  becomes  a  renewal  process.  The  times  between 
successive  failures  form  a  sequence  of  independent  identically  distributed  random  variables. 
Following  the  same  approach  as  in  Section  7.2,  the  added  cost  due  to  unreliability  per  unit  time  will  be 
evaluated.  If  the  checkpointing  interval  is  assumed  to  be  Tck,  the  added  cost  is  given  by 

E[CTck]  =  _L  [Cd  +  E[CR|Tck]  ]  (7.34) 

ck 

E[CR|Tck]  is  the  expected  cost  dut  to  recoveries  from  possible  failures  given  that  the  checkpoint 
interval  is  Tck.  Repeating  the  same  reasoning  as  in  Section  7.2, 

E(CR|Tck]  =  t  PT  (N,  =  n)  E[CR|Tck,Nf  =  nj  (7.35) 

ck 

=  E[Nf|Tck]  [CR°  +  k-^-k  ]  (7.36) 

where  it  has  been  assumed  that  the  recovery  cost  after  each  failure  is  equal  to  a  fixed  cost  CR°  plus  a 
variable  cost  proportional  to  the  time  since  the  last  checkpoint.  The  expected  variable  cost  is  kTck/2. 
Also,  for  a  renewal  process,  the  expected  number  of  failure  during  the  time  Tck  is  Tck  divided  by  the 
Mean  Time  To  Failure  (MTTF).  Hence, 


J 
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(7.37) 


and 

E[CtcK]  =  -i-  [  Cd  +  (  CRo  +  kl|-)  ]  (7.38) 

The  expected  cost  will  be  minimum  when  its  derivative  with  respect  to  Tck  is  zero.  The  optimum 
checkpointing  interval  is  therefore  given  by 


2  Cd  MTTF 
- 


(7.39) 


which  is  exactly  the  result  obtained  by  [Young  74]. 


Indeed,  since  the  failure  process  is  a  renewal  process,  the  expected  cost  depends  on  the  expected 
number  of  renewals  (failures)  per  unit  time,  but  it  is  independent  of  the  PDF  of  the  time  to  failure. 
Therefore  any  result  obtained  under  the  assumptions  of  this  thesis  will  agree  with  previously  obtained 
results  if  the  average  workload  is  assumed  to  be  constant. 


7.3.2.  Periodic  workload 

If  the  average  system  workload  is  given  by  w(t)  and  the  average  failure  rate  is  A(t)  [Chandy  75a]  has 
given  a  recursive  algorithm  to  determine  the  optimum  sequence  of  checkpoint  times  to  minimize  the 
added  cost  due  to  unreliability.  The  solution  given  by  [Chandy  75a]  is  based  on  discretizing  m(t)  and 
A(t)  in  intervals  during  which  they  can  be  assumed  to  be  constant.  Graph  theory  can  then  be  used  to 
determine  the  optimum  collection  of  checkpoint  times. 


The  problem  was  originally  stated  by  [Chandy  75a]  as  follows.  If  the  last  checkpoint  was  performed 


at  time  Tck,  the  cost  due  to  a  possible  failure  at  time  t  is 


CJY  =  CRo  +  k  f 


w(s)  ds 


(7.40) 


If  the  time  required  to  perform  a  checkpoint  is  Cd,  the  total  expected  cost  in  [Tck,t]  is 

rr  .1  fJCk+Cd  f*  (T  .  ,t] 

E[CfTck',I]  =  /  w(t)  dt  +  /  C.  \(r)dr  (7.41) 

Jr.  Jr.  +  c ,  Y  =  l 

ck  cd  d 

Although  according  to  the  results  presented  in  Chapter  4  =  m(t)  +  x{,  the  expected  cost  is  equal  to 

(7.41)  since  E[\t]  =  m(t).  The  optimum  checkpointing  interval  is  the  interval  which  minimizes  the 
above  cost.  By  discretizing  m(t)  [Chandy  75a]  gives  a  recursive  algorithm  to  compute  the  instants  at 
which  checkpoints  must  be  done.  The  way  in  which  the  problem  is  stated  is  precisely  the  main 
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obstacle  to  obtain  a  concise  solution.  Instead,  assume  that  the  system  is  started  at  time  t  due  to  a 
failure  or  that  a  checkpoint  finalized  at  time  t  .  Then 


E[C[,s'Tck1)  =  k '  f  cM  X(r)  dr  +  [  Ck  w(t)  dt 

K  '"T 

Differentiating  v/ith  respect  to  T  ,  the  following  equation  is  obtained. 


k’CR°  +  k’k  X(T 


f  ck 

<*>j[  W(S)  ds  =  w(Tck)  -  w(Tck-Cd) 


(7.42) 


(7.43) 


The  difference  between  the  value  of  T  k  satisfying  (7.43)  and  the  solution  proposed  by  [Chandy  75a] 
is  that  the  value  of  Tck  satisfying  (7.43)  can  be  computed  by  the  system  "on  the  fly".  The  first  term  on 
the  left  hand  of  (7.43)  is  the  fixed  cost  due  to  recovery  from  a  crash.  The  second  term  is  the  variable 
cost  and  increases  as  Tck-tg  increases.  The  right  hand  side  is  the  cost  associated  with  the 
checkpointing  operation.  Thus,  the  above  equation  indicates  that  checkpointing  must  be  performed 
when  the  expected  recovery  cost  exceeds  the  cost  associated  with  checkpointing.  Let  the  system  be 
sampling  the  values  of  w(t)  and  X(t)  at  regular  intervals  At.  Then, 


J  W(s)ds  =  E:=1w(tn)At 


(7.44) 


where  the  first  sample  is  taken  immediately  after  a  checkpoint  has  been  performed  or  a  crash  has 
occurred.  The  system  has  only  to  keep  track  of  the  variables 


Cn(U  =  CB(tn.l)+kk-  X(tn)w(tn)At 


(7.45) 


cck(tn)  =  w(g.w(tn-cd) 


(7  46) 


where 


CR  =  k’  CRo 


(7.47) 


A  checkpoint  must  be  performed  whenever  CR(tn)>Cck(tn).  In  this  way,  the  optimum  checkpointing 
interval  adapts  itself  to  system  behavior,  by  resetting  the  time  scale  every  time  that  a  checkpoint  or  a 
crash  occurs. 
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7.4.  Reliability  modeling  including  transient  hardware  faults, 
software  faults,  and  permanent  hardware  faults 


As  a  final  elaboration  of  the  present  modeling  methods,  a  model  will  be  presented  which  includes 
the  effect  of  permanent  hardware  faults,  in  addition  to  hardware  transients  and  software  faults.  The 
modeled  system  is  a  nonredundant  system  under  constant  average  workload.  Recall  from  Chapters  5 
and  6  that  the  Stationary  model  still  gives  a  much  better  characterization  than  the  any  other  model, 
even  for  systems  under  periodic  workload.  The  assumptions  regarding  the  statistics  of  the  time  to 
permanent  fault  will  be  the  traditional  ones,  i.e.,  the  time  to  permanent  failure  will  be  assumed  to  be 
exponentially  distributed 

Pp(tp<T)  -  1  -  e  XpT  (7.48) 

where  Pp(tp<r)  is  the  probability  that  a  permanent  fault  will  occur  before  time  r.  The  PDF  of  the  time 
to  failure  due  to  transients  and  software  will  be  assumed  to  be  any  of  the  distributions  given  in 
Chapter  4  under  the  constant  workload  assumption 

/h(s)ds 

_  (7.49) 

where  h(t)  is  any  of  the  hazard  functions  given  in  Section  4.1. 

7.4.1 .  Markov  processes 


Reliability  modeling  for  permanent  faults  is  often  characterized  by  means  of  Markov  processes. 
Central  to  the  theory  of  Markov  processes  are  the  concepts  of  state  and  state  transition.  The  state  of  a 
system  represents  all  that  it  is  needed  to  know  to  describe  the  system  at  any  instant.  In  the  course  of 
time  the  system  passes  from  state  to  state  and  therefore  exhibits  a  dynamic  behavior.  If  the  system 
can  be  characterized  by  its  continuous  time  evolution  thorough  a  discrete  state  space,  at  any  instant 
the  system  is  in  one  of  N  states,  and  transitions  between  states  occur  at  random  times.  The 
distinguishing  property  of  Markov  processes  is  that  they  must  satisfy  the  following  property 


P(S«  -®n|V-Si . Stn1=3-1)  =  P(S<A,=Sn-lJ 

n  i  n-i  n  n-1 


(7.50) 


where  st  denotes  the  state  occupied  at  time  tp.  The  above  equality  has  the  following  implications: 
n 


•  The  probability  of  occupying  any  state  in  the  future  depends  only  on  the  state  presently 
occupied. 


•  The  pdf  of  the  time  to  the  next  transition  does  not  depend  on  how  long  the  present  state 
has  been  occupied  nor  on  the  destination  state 
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For  continuous  time  Markov  processes,  the  above  property  in  fact  implies  that  the  time  to  transition 
must  be  exponentially  distributed  [Howard  7t],  A  stationary  Markov  process  is  then  completely 
specified  by  its  transition  probability  matrix 

P  =  {pij;i,i  =  t . N}  (7.51) 

where 

p;j  =  P{next  state  is  [(present  state  is  i}  (7.52) 

An  equivalent  characterization  of  a  Markov  process  is  in  terms  of  the  transition  rates  matrix  A 

A  =  {Xjjji.i  =  1 N}  (7.53) 

where 

-\..t 

9^(0  =  "  (7.54) 

is  the  pdf  of  the  time  to  transition  to  state  k  given  that  the  process  enters  state  i  at  time  0. 


State:  1:0  Operational 

2:F  Failed 

Figure  7-6:  Characterization  of  the  reliability  of  a  nonredundant  system 
subject  to  permanent  hardware  faults  by  a  Markov  process.  A  is  the  rate  at 
which  permanent  failures  occur  and  Af  is  the  rate  at  which  repairs  take  place. 

Figure  7-6  summarizes  the  characterization  of  a  nonredundant  system  subject  to  permanent 
hardware  faults  only.  Since  the  system  can  be  only  operational  or  failed,  the  failure  process  is 
characterized  as  a  2  state  Markov  process.  The  times  to  failure  and  to  repair  are  exponentially 
distributed.  The  MTTF  and  MTTR  values  are  1/Ap  and  l/\r  respectively. 
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7.4.2.  Semi-Markov  processes 

Markov  processes  are  not  appropriate  to  characterize  the  reliability  of  systems  subject  to 
permanent,  transient,  and  software  failures.  The  PDF  of  the  time  to  failure  due  to  transients  or 
software  has  been  shown  to  have  a  decreasing  hazard  function,  and  according  to  the  statistical  tests 
performed  in  Chaper  6,  it  is  not  properly  described  by  an  exponential  distribution.  Hence,  for  a  system 
subject  to  the  three  types  of  failures  the  PDF  of  the  time  to  transition  depends  on  the  destination  state. 
This  PDF  will  be  exponential  if  the  destination  state  is  failed  due  to  a  permanent  failure,  or  it  will  be  of 
the  form  given  in  (7.49)  if  the  destination  state  is  failed  due  to  a  transient  or  software  error. 

This  dependency  of  the  pdf  of  the  time  to  transition  on  the  destination  state  is  precisely  the 
distinguishing  property  of  the  so  called  Semi-Markov  processes.  A  system  characterized  by  a  Semi- 
Markov  process  is  always  in  one  of  N  states.  Successive  state  occupancies  are  governed  by  the 
transition  probabilities  of  a  Markov  process.  At  transition  instants,  the  system  behaves  as  a  Markov 
process,  and  the  process  determining  such  transitions  process  is  called  the  embedded  Markov 
process.  The  imbedded  Markov  process  is  completely  described  by  a  NxN  matrix  of  transition 
probabilities  P  as  defined  in  (7.51).  In  addition,  in  a  Semi-Markov  process,  whenever  the  system 
enters  state  i  it  is  imagined  that  it  determines  the  next  state  j  according  to  state  i’s  transition 
probabilities  {piV...,PjN}.  After  j  has  been  chosen,  the  system  "holds"  for  a  random  time  t..  in  state 
i.  The  pdf  of  r-  is  given  by  q^t),  obtaining  therefore  a  vector  of  pdf’s  for  each  state  i.  Hence,  a  Semi- 
Markov  process  is  completely  determined  only  if  both  the  matrix  P  and  the  pdf’s  matrix  Q(t) 

Q(t)  ■  {djj(t);i,j  ■  1 . N}  (7.55) 

are  available. 

Figure  7-7  synthizes  how  a  non  redundant  computing  system  can  be  characterized  by  a  Semi- 
Markov  process  incorporating  the  effects  of  permanent  hardware  failures,  transient  hardware  failures, 
and  software  failures.  The  system  is  operational  when  in  state  1.  The  system  selects  then  the  next 
state  according  to  the  transition  probabilities  p12,  p13.  If  the  destination  state  is  state  2  (Failed  due  to 
transients  or  software),  the  system  selects 

/h(s)  ds 

(7-56) 

as  the  pdf  of  the  time  to  transition.  If  the  next  state  is  3  (failed  due  to  a  permanent  hardware  failure)  an 
exponential  distribution  with  parameter  Ap  is  selected  as  the  PDF  of  the  time  to  transition 
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State : 

1:0 

Operational 

2:F, 

Failed  due  to  transients  or  software 

3:FP 

Failed  due  to  permanent  failures 

Figure  7-7:  Characterization  of  a  non  redundant  system  subject  to 
permanent  and  transient  hardware  failures,  and  software  failures 

Qi3(t)  =  Xpe  V  (7.57) 

The  other  transitions  are  similarly  characterized.  If  the  system  is  in  state  2  (failed  due  to  transients  or 
hardware)  it  will  become  operational  after  a  fixed  recovery  time  tr  and  therefore  the  pdf  of  the  time  to 
restart  is  q21(t)  =  $( tf).  A  permanent  hardware  failure  may  also  occur  while  the  system  is  recovering 
from  a  transient.  The  pdf  of  the  time  to  such  event  is  an  exponential  distribution  truncated  at  t  =  tr..lf 

the  system  is  in  state  3  (failed  due  to  a  permanent  hardware  failure)  it  will  always  recover  after  a 

random  time  exponentially  distributed  with  parameter  \r  and 

q3,(t)  =  \fe‘ Kft  (7.58) 

Note  that  Ar  is  not  the  rate  at  which  permanent  failures  occur,  but  the  rate  at  which  permanent 
failures  are  observed  since  the  last  system  restart. 
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7.4.2. 1.  Limiting  behavior 

Define  now  the  following  matrix  of  time  varying  functions 

*(t)  =  {9ij(t);U  =  1 . N}  (7.59) 

where  tpjjft)  is  the  probability  that  the  system  will  occupy  state  j  at  time  t  given  that  it  entered  state  i  at 
time  0.  Then  it  can  be  proven  that 

•PijW  -  sj  Qj(T)dr  +  Hk=1  pik  jf  Qy(T )  9j|(t-r) dr  (7.60) 

Equation  (7.60)  requires  the  solution  of  a  system  of  integral  equations  whenever  numerical  values  of 
<t>(t)  are  required.  Although  this  system  of  equations  can  sometimes  be  solved  by  using  Laplace 
transform  methods  (see  [Howard  71])  it  is  cumbersome.  However,  if  the  desired  knowledge  is  only  on 
the  average,  a  simpler  result  can  be  used.  Let 

<p,i  -  lim  <p..(t)  (7.61) 

"  t-*00  ’ 

Then  <p..  is  the  average  fraction  of  time  spent  in  time  j  given  that  the  system  entered  state  i  at  time  0. 
For  example,  the  value  of  <pu  for  the  system  described  in  Figure  7-7  is  the  system  availability,  since  it 
is  equal  to  the  fraction  of  time  that  the  system  is  operational  given  that  the  system  was  first  started  at 
time  0.  A  basic  result  of  Semi-Markov  theory  is 


«Pii 


(7.62) 


EW 


(7.63) 


where  w  »(w1,...,irN)  is  the  limiting  state  probability  vector  of  the  embedded  Markov  process.  Such  a 
vector  can  be  obtained  by  solving  the  system  of  equations 


v  *  tr  P 


(7.64) 


subject  to  the  condition 

En 

i  s  i  w j  *  1  (7 -65) 

Equation  (7.63)  is  important  in  that  it  implies  that  the  only  statistic  of  the  holding  times  that  affects  the 
limiting  behavior  of  states  occupancies  is  the  expected  value. 
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As  a  simple  example  consider  the  CMU-  I0A.  Repair  takes  place  with  a  frequency  smaller  than  once 


a  month.  Since  the  system  crashes  on  the  average  every  9  hours, 

p12  =  0.999  (7.66) 

p13  =  0.001  (7.67) 

Also,  it  will  be  assumed  that 

p21  -  1  (768) 

Pjjj  =  0  (7.69) 

p31  -  1  (7.70) 

P;ja  =  0  (771) 

The  following  values  are  then  obtained  for  n 

w,  =  0.5  (7.72) 

w2  =  0.4995  (7.73) 

w3  =  0.0005  (774) 

Assuming 

E[t.,]  »  9  hours  (Mean  Time  To  Failure)  (7.75) 

E[t2]  «  15  minutes  (Mean  time  to  recover  from  transients)  (7.76) 

E[t3]  =  2  hours  (Mean  Time  To  Repair)  (7.77) 

then 

Availability  =  911  »  0.97  (7.78) 

7. 4. 2. 2.  Reliability  prediction 


Let  the  pdf  of  the  time  to  failure,  ptf(t)  be  the  unconditional  pdf  of  the  time  to  transition  from  state  1 , 
independently  of  the  destination  state.  Thus, 

P„(t)  =  P,2qi2(t)  +  P,3qi3(t)  (7-79) 


and  the  reliability  function  becomes 
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R(t)  =  P(tf>t) 


(7.80) 


■/ 


oo 

P«(s)  ds 


=  P12  R12W  +  Pi3  R13W 

where  R10(t)  is  the  reliability  function  due  to  transients  and  software 
h(s)  ds 


(t)  is  the  re 

l 


'12 

and  R13  is  the  reliability  function  due  to  permanent  failures 
■At 

R13(t)  =  e  p 


(7.81) 

(7.82) 

(7.83) 

(7.84) 


Since  p,„>»p„,  the  system  reliability  is  essentially  the  reliability  function  discussed  in  Chapter  6  for 

12  1g 

transients  and  software  according  to  the  Stationary  model. 


7.5.  Summary 

The  modeling  methodology  introduced  in  Chapters  3  and  4  has  been  extended  in  this  chapter  to 
derive  some  important  applications  to  Reliability  modeling  and  cost/benefit  analysis  of  fault- 
tolerance.  In  particular,  the  following  extensions  have  been  considered: 

•  Decomposition  of  the  failure  process  in  its  software  and  hardware  components.  Although 
on  the  average  the  probability  that  a  crash  is  due  to  software  may  be  of  0.6,  the  impact  of 
unreliable  software  may  be  much  more  important  due  to  the  fact  that  the  system  crashes 
more  often  in  periods  of  high  load  when  the  contribution  of  uncorrect  software  is  larger 
than  average. 

•  Evaluation  of  the  added  cost  due  to  unreliability  both  from  a  user's  viewpoint  and  from  a 
system  manager’s  viewpoint.  Curiously,  the  fact  of  having  a  periodic  workload  (as 
opposed  to  constant)  has  an  associated  cost  in  itself. 

•  Study  of  previous  results  to  evaluate  the  optimum  checkpointing  interval.  A  new  result 
has  been  presented  in  the  case  of  periodic  workload. 

•  Introduction  to  models  incorporating  the  effects  of  software  errors,  transient  hardware 
faults,  and  permanent  hardware  faults. 


CONCLUSIONS  AND  SUGGESTIONS  FOR  FURTHER  RESEARCH 


133 


Chapter  8 

Conclusions  and  suggestions  for  further  research 

Through  the  thesis,  the  two  main  questions  for  which  a  simultaneous  answer  has  been  sought  are 
Question  1  :  What  is  it  desirable  to  know  about  computing  systems  reliability? 

Question  2 :  What  variables  can  be  easily  measured  from  real  systems? 

If  a  simultaneous  answer  for  both  questions  exists,  it  must  obviously  a  compromise,  since  the  answer 
to  the  first  question  is  "everything".  And  this  will  never  be  known.  Prehaps  the  closest  answer  to  the 
above  two  questions  are  the  results  presented  in  Chapter  7,  were  methods  to  evaluate  the  impact  of 
unreliability,  and  methods  to  trace  the  impact  of  each  cause  of  unreliability  (permanent  hardware 
failures,  transient  hardware  failures,  and  software  failures)  have  been  presented. 

It  was  in  Chapter  2  that  it  was  claimed  that  an  apparent  conflict  would  be  solved.  The  fact  that  a 
system  fails  more  during  prime  time  is  widely  accepted.  And  no  statistical  tests  can  contradict  the  fact 
that  the  Weiubull  distribution  characterizes  is  a  better  distribution  to  characterize  the  time  to  failure 
than  an  Exponential  or  Periodic  model,  even  though  the  Weibull  model  does  not  include  periodicity 
concepts.  The  answer  to  this  apparent  conflict  seems  to  be  to  consider  the  failure  rate  to  be  a 
Gaussian  process  with  periodic  statistics,  i.e.,  a  cyclostationary  process.  The  after  effects  of  this 
approach  have  been 

•  Derivation  of  the  general  properties  of  the  class  of  Doubly  Stochastic  Poisson  process 
whose  failure  rate  is  a  Gaussian  process  (Chapter  3). 

•  Characterization  of  doubly  stochastic  Poisson  process  whose  intensity  is  either  a 
stationary  or  a  cyclostationary  Gaussian  process.  In  particular  a  complete  family  of 
distributions  commonly  used  in  statistical  analysis  of  failure  date  have  been  shown  to  be 
special  cases  of  this  approach  (Chapter  4).  As  a  side  effect,  the  general  properties  of  the 
unreliable  behavior  of  computing  systems  operating  under  periodic  or  constant  workload 
have  been  established. 


•  Elaboration  of  the  necessary  techniques  for  model  parameterization  and  validation 
(Chapter  5). 
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•  Validation  of  the  model  by  comparison  with  the  actual  behavior  of  a  real  system,  and 
comparison  of  its  predictions  with  the  predictions  of  other  models  (Chapter  6).  In 
particular,  stablishment  of  ranges  for  which  other  models  may  lead  to  over  optimistic  or 
over  pessimistic  expectations. 

•  Establishment  of  cost  and  benefit  measures  of  fault-tolerance  derived  from  the  modeling 
methodology  (Chapter  7). 


Since  the  main  results  presented  in  Chapters  3,4,  and  7  are  original,  a  cautious  approach  must  be 
taken  in  deriving  conclusions  from  these  results  until  further  proofs  of  their  validity  are  available.  With 
caution  and  the  two  above  questions  in  mind,  the  following  sections  summarize  the  preliminary 
conclusions  derived  from  this  thesis  and  pose  some  interesting  unanswered  questions.  Traditionally, 
these  new  questions  will  require  some  more  research  to  be  answered,  or  they  will  be  forgotten. 


8.1.  Reliability  modeling 

Through  the  thesis  a  reliability  modeling  methodology  has  been  developed  starting  from  basic 
principles  of  operation  of  Time  Sharing  systems.  Nevertheless,  it  should  be  noted  that  the  original 
MULTICS  design  dates  from  the  early  1960's.  Model  validation  has  been  done  with  the  TOPS-IO 
operating  system,  already  more  than  10  years  old.  Why  then  bother  to  study  such  systems?  Would  not 
it  be  better  to  study  state  of  the  art  Time  Sharing  systems,  multiprocessors,  multicopmuters,  or 
collections  of  personal  computers  operating  in  Local  Area  Networks  ? 

The  fact  is  that  current  systems  still  adopt  the  basic  conventions  of  the  original  MULTICS  design. 
For  example,  the  IBM  4341  processor  executing  the  operating  system  VM/370  does  not  attempt  to 
recover  from  transient  hardware  failures  if  these  failures  occur  while  the  system  executes  in  kernel 
mode  [Ciacelly  81].  The  system  attempts  to  recover  when  transient  failures  occur  in  other  modes, 
which  is  one  of  the  central  hypothesis  of  work  of  the  present  thesis.  As  for  the  extension  of  similar 
modeling  methods  to  other  systems  such  as  multiprocessors  or  multicomputers,  note  that  the 
methods  followed  in  this  thesis  are  oriented  to  the  steady  state  system  characterization  relying  heavily 
on  operating  system  measuring  system  facilities  such  as  error  logs,  system  tables  with  accounting 
information,  and  so  on.  These  measuring  tools  are  available  in  operating  systems  for  purposes  of 
accounting,  maintenance  aids,  or  system  tunning  facilities.  But  these  measuring  facilities  have  been 
used  here  for  reliability  characterization  purposes. 


Few  multiprocessors  are  available  today  for  general  use  and  experimentation.  Of  the 
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multiprocessors  available,  most  are  experimental  systems  (such  as  C.mmp  [Wulf  31]  and  Cm*  [Jones 
80])  are  far  from  being  general  purpose  systems  or  are  dedicated  to  the  execution  of  relatively  simple 
real  time  functions  (such  as  Pluribus  [Ornstein  74]  whose  only  function  is  that  of  packet  switching,  or 
other  multiprocessors  dedicated  to  telephone  switching  functions).  Obviously,  in  most  experimental 
systems  the  concept  of  steady  state  operation  is  not  defined,  and  being  vehicles  of  experimentation, 
the  software  is  usually  changing  continuously.  Further,  no  system  available  for  experimentation  has 
the  necessary  measuring  tools  required  to  validate  theoretical  models.  The  emphasis  given  in 
Tandem  systems  [Katzmanu  77]  to  instrumentation  problems  is  significant  [Blake  80],  since  Tandem 
systems  are  at  present  the  only  multiprocessors  offering  high  reliability  in  general  purpose 
applications.  Further,  before  attacking  the  problem  of  characterizing  the  unreliable  behavior  of 
multiprocessors  due  to  hardware  transients  and  software,  it  seems  reasonable  to  solve  first  the 
problem  for  simpler,  more  accessible  systems  such  as  Time  Sharing  computers. 

Nevertheless,  it  is  expected  that  several  of  the  new  results  presented  in  this  thesis  will  be  applicable 
to  other  systems.  In  Local  Area  Networks,  expensive  facilities  such  as  centralized  file  systems  or 
expensive  peripherals  are  likely  to  operate  in  Time  Sharing  mode,  their  reliability  characterization 
being  characterized  by  the  same  principles  exposed  in  this  thesis. 

The  model  presented  in  Section  7.4  incorporating  permanent  hardware  failures,  transient  failures, 
and  software  failures  can  be  viewed  as  a  first  step  in  the  characterization  of  the  unreliable  behavior  of 
multiprocessor  systems.  The  extension  of  this  model  to  multiprocessors  is  desirable  but  not  at  all  an 
easy  task.  First,  note  that  model  parameterization  is  possible  only  after  detailed  knowledge  about  the 
relationships  between  resource  usage  and  unreliable  manifestations.  Remember  how  the  PDF  of  the 
time  to  failure  due  to  hardware  transients  and  software  has  been  derived.  Secondly,  the  "  induction 
of  redundancy  in  hardware,  software,  or  both  may  lead  to  unexpected  results  since  the  failure 
processes  due  to  software  and  transients  are  not  independent,  but  both  depend  on  workload  time 
varying  patterns.  Thus,  care  is  necessary  when  elaborating  the  model  to  systems  with  some  degree  of 
redundancy. 

A  problem  that  has  been  systematically  ignored  through  the  thesis  is  the  distinction  between 
transients  and  intermittent  faults.  While  transient  failures  are  manifestation  of  changing 
environmental  conditions  (such  as  cosmic  rays)  or  consequences  of  limitations  in  manufacturing 
processes  (such  as  the  presence  of  radioactive  materials  in  packaging  materials),  intermittent  faults 
are  manifestations  of  physical  degenerative  processes  (for  instance,  oxidation  in  a  terminal  contact). 
Still,  much  of  the  results  presented  in  this  thesis  should  be  valid,  since  an  intermittent  fault  can  only 
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be  detected  when  exercising  the  component  affected  by  the  degenerative  process.  However,  the 
distinction  between  transients  and  intermittents  is  useful  for  diagnosis  and  replacement  policies. 
Current  efforts  in  this  direction  have  been  reported  by  [Bossen  81]. 


Finally,  a  sensitivity  analysis  should  be  performed  establishing  levels  of  confidence  of  the  reliability 
predictions  of  the  Stationary  and  Cyclostationary  models  depending  on  the  parameter  estimation 
procedures  and  sample  size. 

In  summary,  the  main  topics  in  which  further  research  is  to  be  expected  are : 

•  Incorporation  in  state  of  the  art  systems  of  exhaustive  measuring  capabilities  to  allow 
system  characterization  and  model  validation  with  relatively  minor  effort. 

•  Extension  of  the  present  modeling  methods  (i.e.,  hardware/software  prediction  models) 
to  systems  having  some  degree  of  redundancy  at  the  subsystem  level  such  as 
multiprocessors. 

•  Better  understanding  in  the  differences  in  the  manifestations  of  transients  and 
intermittent  faults. 

•  Sensitivity  analysis  of  reliability  predictions. 


8.2.  Performance/Reliability  modeling 

The  above  considerations  are  specially  relevant  to  Performance/Reliability  modeling  techniques  of 
systems  having  some  degree  of  redundancy  such  as  multiprocessors.  While  in  a  (uniprocessor)  Time 
Sharing  system  singularities  are  easily  identified  (i.e.,  the  hardware  and  the  kernel  of  the  operating 
system)  for  a  multiprocessor  singularities  may  form  a  dynamically  changing  collection  of  resources. 

Some  Performance/Reliability  models  were  referenced  in  Chapter  2.  Most  of  those  models  assume 
that  upon  failure  detection  the  system  may  reconfigure  itself  and  continue  operating  in  a  degraded 
performance  state  until  repair  takes  place.  These  models  attempt  then  to  characterize  how  system 
performance  is  likely  to  evolve  in  time  depending  on  the  presence  cf  different  types  of  failures.  The 
main  assumption  common  to  all  these  existing  models  is  that  they  all  use  Markov  models  as  the 
underlying  abstraction.  That  means  that  all  the  times  between  state  transitions  are  exponentially 
distributed.  However,  it  has  been  shown  in  this  thesis  that  the  distribution  of  the  time  to  failure  due  to 
transients  and  software  cannot  be  approximated  by  an  exponential  distribution.  Therefore, 
Performance/Reliability  models  will  have  to  evolve  into  Semi-Markov  models  where  the  distributions 
of  failures  due  to  software  and  transients  are  of  the  type  derived  in  Chapter  4. 
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If  all  it  is  needed  to  know  about  the  system  is  its  steady  state  behavior,  the  distinction  between 
Markov  and  Semi-Markov  modeling  is  not  important.  As  it  has  been  shown  in  Section  7.4  steady  state 
characterization  for  a  Semi-Markov  model  depends  only  on  the  expected  times  between  transitions. 
However,  if  some  more  detailed  knowledge  is  required,  Semi-Markov  modeling  is  unavoidable.  Recall 
from  Section  6.1.2  that  the  differences  in  reliability  predictions  between  the  exponential  distribution 
and  the  distribution  predicted  by  the  Stationary  approximation  are  not  neglectable  for  values  of  time 
smaller  than  the  MTTF  value.  Therefore,  the  distiction  between  Markov  and  Semi-Markov  modeling  is 
especially  relevant  for  systems  having  exceptional  reliability  requirements  during  periods  of  time 
smaller  than  the  expected  MTTF  value.  This  is  the  case,  for  example,  of  SIFT  [Wensley  78]  and  FTMP 
[Hopkins  78]. 


8.3.  Software  reliability  evaluation  and  the  design  of  reliable 
software 

The  central  argument  of  this  thesis  with  respect  to  software  reliability  is  that  the  observed  software 
reliability  depends  on  the  instantaneous  complexity  of  the  data  to  be  processed.  Certainly,  when  a 
software  package  is  implemented  it  is  expected  to  cope  equally  well  in  all  situations  for  which  it  has 
been  designed  to  work.  However,  given  that  the  software  is  subject  to  imperfections,  it  is  more  likely 
the  such  imperfections  will  be  noticed  while  processing  data  describing  situations  of  high  complexity 
than  processing  data  describing  simple  situations.  This  is  so  because  simpler  situations  are  easier  to 
understand,  the  software  for  them  is  easier  to  design,  and  easier  to  debug. 

This  discussion  is  in  rather  loose  terms  because  the  lack  of  a  suitable  descriptor  for  the  meaning  of 
"complexity".  But  note  that  here  complexity  is  an  attribute  of  the  world  as  seen  by  the  software,  not 
an  attribute  of  the  software  itself.  However,  the  world  seen  by  the  software  is  just  the  state  of  its  data 
structures.  If  the  only  descriptors  that  can  be  obtained  about  the  complexity  of  a  situation  to  be 
processed  by  the  software  is  by  means  of  the  state  of  its  data  structures,  such  descriptors  will  be  very 
much  representation  dependent.  By  dependency  on  representation  it  is  meant  that  different  situations 
with  the  same  inherent  complexity  may  lead  to  different  software  reliability  characterizations 
depending  on  the  representation  adopted  in  the  data  structures  to  represent  such  complexity. 
Consider,  for  instance,  the  problem  of  deadlocks.  Several  processes  request  the  allocation  of  several 
resources.  If  some  processes  are  processing  for  each  other’s  resources  but  all  are  waiting  and  none 
is  able  to  release  a  resource,  deadlock  occurs.  However  note  that  the  number  of  processes,  requests, 
and  resources  (which  together  determine  the  complexity  of  the  situation  to  be  processed)  deadlock 
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will  occur  only  under  certain  ordering  of  requests,  while  sequences  with  different  orderings  may  be 
handled  properly.  The  argument  now  is  that  on  the  average  data  describing  more  complex  situations 
will  be  harder  to  process,  and  therefore  more  prone  to  the  manifestation  of  a  software  fault. 

This  representation  dependency  of  the  complexity  of  the  world  seen  by  the  software,  and  its 
relevance  to  software  reliability  manifestations  has  however  some  potential  advantages.  Consider  a 
piece  of  code  that  operates  on  certain  data  structures  in  such  a  way  that,  the  same  code,  fed  with  the 
same  input  data,  uses  different  internal  representations  in  its  data  structures  according  to  some 
random  factor.  Then  code  replication  to  increase  reliability  makes  sensei 

Indeed,  consider  a  software  package  operating  over  a  variety  of  data  structures  such  as  lists, 
queues,  and  arrays.  Assume  that  the  code  has  been  written  in  such  a  way  that  data  structure 
initialization  (and  even  perhaps  allocation)  is  random.  That  is,  no  two  initialization  sequences  lead  to 
the  same  representation  of  the  same  situation.  This  can  be  accomplished,  for  instance,  by  chosing 
the  header  for  the  queues  at  random  in  a  circular  buffer.  Consider  now  two  copies  of  the  same  code 
running  in  parallel  in  separate  processors  (or  sequentially  in  the  same  processor).  As  the  code  is 
feeded  with  external  input  data,  both  copies  will  use  different  representations  in  their  internal  data 
structures.  Thus,  in  some  cases,  one  copy  may  manifest  a  software  fault  due  to  a  particular 
representation,  while  the  other  copy  may  be  able  of  handling  the  same  situation  without  problem. 

The  above  arguments  are  highly  speculative  and  their  validation  would  require  (at  least)  the  design 
of  a  complete  experiment  and  background  study  as  it  has  been  done  in  the  present  thesis.  However, 
this  potential  approach  to  the  design  of  reliable  software  has  been  presented  here  because  it  is  a 
natural  extension  of  the  methodology  followed  in  this  thesis. 
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